Skip to main content

Advanced Slurm

Slurm environments variables

When a job scheduled by Slurm starts, it needs to know certain information about how it was scheduled, etc. E.g., what is its working directory, or what nodes were allocated for it. Slurm passes this information to the job via environmental variables.

The following is a list of commonly used variables that are set by Slurm for each job.

VariableDescription
$SLURM_JOB_IDThe Job ID
$SLURM_SUBMIT_DIRThe path of the job submission directory
$SLURM_SUBMIT_HOSTThe hostname of the node used for job submission
$SLURM_JOB_NODELISTContains the definition (list) of the nodes that is assigned to the job
$SLURM_CPUS_PER_TASKNumber of CPUs per task
$SLURM_CPUS_ON_NODENumber of CPUs on the allocated node
$SLURM_JOB_CPUS_PER_NODECount of processors available to the job on this node
$SLURM_CPUS_PER_GPUNumber of CPUs requested per allocated GPU
$SLURM_MEM_PER_CPUMemory per CPU. Same as --mem-per-cpu
$SLURM_MEM_PER_GPUMemory per GPU
$SLURM_MEM_PER_NODEMemory per node. Same as --mem
$SLURM_GPUSNumber of GPUs requested
$SLURM_NTASKSSame as -n, --ntasks. The number of tasks
$SLURM_NTASKS_PER_NODENumber of tasks requested per node
$SLURM_NTASKS_PER_SOCKETNumber of tasks requested per socket
$SLURM_NTASKS_PER_CORENumber of tasks requested per core
$SLURM_NTASKS_PER_GPUNumber of tasks requested per GPU
$SLURM_NPROCSSame as -n, --ntasks. See $SLURM_NTASKS
$SLURM_NNODESTotal number of nodes in the job’s resource allocation
$SLURM_TASKS_PER_NODENumber of tasks to be initiated on each node
$SLURM_ARRAY_JOB_IDJob array’s master job ID number
$SLURM_ARRAY_TASK_IDJob array ID (index) number
$SLURM_ARRAY_TASK_COUNTTotal number of tasks in a job array
$SLURM_ARRAY_TASK_MAXJob array’s maximum ID (index) number
$SLURM_ARRAY_TASK_MINJob array’s minimum ID (index) number

MesoBFC Cluster Special variables

ValueDescription
$GPU_SCRATCH_DIRAvailable only on GPU partition. Local Scratch folder based on SSD. This folder is purged on demand
$JOBSCRATCH or $TMPDIRlocal scratch dir, limited on size. Delete on job exit

Job States

During its lifetime, a job passes through several states. The most common states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.

StateDescription
PDPending. Job is waiting for resource allocation
RRunning. Job has an allocation and is running
SSuspended. Execution has been suspended and resources have been released for other jobs
CACancelled. Job was explicitly cancelled by the user or the system administrator
CGCompleting. Job is in the process of completing. Some processes on some nodes may still be active
CDCompleted. Job has terminated all processes on all nodes with an exit code of zero
FFailed. Job has terminated with non-zero exit code or other failure condition

A job may still in pending state for a while. The main reasons are:

(Resources) the job is waiting for resources to become available so that the jobs resource request can be fulfilled.

(Priority) the job is not allowed to run because at least one higher prioritized job is waiting for resources.

(Dependency) the job is waiting for another job to finish first (--dependency=… option).

(DependencyNeverSatisfied) the job is waiting for a dependency that can never be satisfied. Such a job will remain pending forever. Please cancel such jobs.

(QOSMaxCpuPerUserLimit) the job is not allowed to start because your currently running jobs consume all allowed CPU resources for your user in a specific partition. Wait for jobs to finish.

Job Monitoring

Use the squeue command to get a high-level overview of all active (running and pending) jobs in the cluster.

Use the scontrol command to show more detailed information about a job. For example: $ scontrol show job 500

Use jobinfo or seff to view information about running/completed job.
For example:

$ jobinfo 1863
Name : 15_15_100
User : xxxxx
Account : yyyyy
Partition : mpi
Nodes : node4-[13-14]
Cores : 48
GPUs : 0
State : COMPLETED
ExitCode : 0:0
Submit : 2022-01-09T00:58:21
Start : 2022-01-09T23:43:24
End : 2022-01-10T08:02:16
Waited : 22:45:03
Reserved walltime : 1-00:00:00
Used walltime : 08:18:52
Used CPU time : 16-15:00:06
% User (Computation): 98.91%
% System (I/O) : 1.09%
Mem reserved : 3872M/core
Max Mem used : 57.63M (node4-13)
Max Disk Write : 1.66G (node4-13)
Max Disk Read : 13.95M (node4-13)
$ seff 1863
Job ID: 1863
Cluster: mesobfc
User/Group: xxxx/yyyyy
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 16-15:00:06
CPU Efficiency: 99.98% of 16-15:05:36 core-walltime
Job Wall-clock time: 08:18:52
Memory Utilized: 2.70 GB (estimated maximum)
Memory Efficiency: 1.49% of 181.50 GB (3.78 GB/core)

To use linux commands (ps, top...) to monitor your jobs, you can connect to nodes using ssh.

This works only for nodes you have jobs running on them. When the job is finished, you will be kicked out of the node.

Job accounting data

After the job exits from the queue, the sacct command helps to report the job statistics. For example

$ sacct -j 10889 

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10889 nf-shortV+ smp xxx 6 COMPLETED 0:0
10889.batch batch xxx 6 COMPLETED 0:0
10889.extern extern xxx 6 COMPLETED 0:0

By default, the sacct command will only bring up information about the user's job from the current day. By using the --starttime flag the command will look further back to the given date e.g. :

sacct --user=$USER --starttime=YYYY-MM-DD

--format flag can be used to choose the command output:

sacct --user=$USER --format=var_1,var_2, ... ,var_N
VariableDescription
AccountThe account the job ran under.
AveCPUAverage (system + user) CPU time of all tasks in job.
AveRSSAverage resident set size of all tasks in job.
AveVMSizeAverage Virtual Memory size of all tasks in job.
CPUTimeFormatted (Elapsed time * CPU) count used by a job or step.
ElapsedJobs elapsed time formated as DD-HH:MM:SS.
ExitCodeThe exit code returned by the job script or salloc.
JobIDThe id of the Job.
JobNameThe name of the Job.
MaxRSSMaximum resident set size of all tasks in job.
MaxVMSizeMaximum Virtual Memory size of all tasks in job.
MaxDiskReadMaximum number of bytes read by all tasks in the job.
MaxDiskWriteMaximum number of bytes written by all tasks in the job.
ReqCPUSRequested number of CPUs.
ReqMemRequested amount of memory.
ReqNodesRequested number of nodes.
NCPUSThe number of CPUs used in a job.
NNodesThe number of nodes used in a job.
UserThe username of the person who ran the job.

A full list of variables for the --format flag can be found with the --helpformat

Job dependencies

A job can be given the constraint that it only starts after another job has finished.

In the following example, we have two Jobs, A and B. We want Job B to start after Job A has successfully completed.

First we start Job A by submitting it via sbatch:

sbatch jobA.sh

Making note of the assigned job ID for Job A, we then submit Job B with the added condition that it only starts after Job A has successfully completed:

sbatch --dependency=afterok:jobID_A jobB.sh

If we want Job B to start after several other Jobs have completed, we can specify additional jobs, using a : as a delimiter:

sbatch --dependency=afterok:jobID_A:jobID_C:jobID_D jobB.sh

We can also tell slurm to run Job B, even if Job A fails, like so:

sbatch --dependency=afterany:jobID_A jobB.sh

Slurm supports a number of different dependency types, for example:

-d after:123456      # job can begin execution after the specified job has begun execution
-d afterany:123456 # job can begin execution after the specified job has finished
-d afternotok:123456 # job can begin execution after the specified job has failed (exit code not equal zero)
-d afterok:123456 # job can begin execution after the specified job has successfully finished (exit code zero)
-d singleton # job can begin execution after any previously job with the same job name and user have finished
tip

One can use -d option instead of --dependency

Job Array

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial options (e.g. size, time limit, etc.), however it is possible to change some of these options after the job has begun execution using the scontrol command specifying the JobID of the array or individual ArrayJobID.

Full documentation here

Possibility:

# Submit a job array with index values between 0 and 31
#SBATCH --array=0-31

# Submit a job array with index values of 1, 3, 5 and 7
#SBATCH --array=1,3,5,7

# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
#SBATCH --array=1-7:2

Exemple: process a lot of files

Run a calculation on several input files can be performed in a single job script file:

#!/bin/bash -l
#SBATCH --array=0-9 # 10 jobs for 10 fastq files

INPUTS=(../fastqc/*.fq.gz) # INPUTS is bash array holding files from INPUTS[0] to INPUTS[9]
srun fastqc ${INPUTS[$SLURM_ARRAY_TASK_ID]} ## $SLURM_ARRAY_TASK_ID contains TASK_ID from 0 to 9