Advanced Slurm
Slurm environments variables
When a job scheduled by Slurm starts, it needs to know certain information about how it was scheduled, etc. E.g., what is its working directory, or what nodes were allocated for it. Slurm passes this information to the job via environmental variables.
The following is a list of commonly used variables that are set by Slurm for each job.
Variable | Description |
---|---|
$SLURM_JOB_ID | The Job ID |
$SLURM_SUBMIT_DIR | The path of the job submission directory |
$SLURM_SUBMIT_HOST | The hostname of the node used for job submission |
$SLURM_JOB_NODELIST | Contains the definition (list) of the nodes that is assigned to the job |
$SLURM_CPUS_PER_TASK | Number of CPUs per task |
$SLURM_CPUS_ON_NODE | Number of CPUs on the allocated node |
$SLURM_JOB_CPUS_PER_NODE | Count of processors available to the job on this node |
$SLURM_CPUS_PER_GPU | Number of CPUs requested per allocated GPU |
$SLURM_MEM_PER_CPU | Memory per CPU. Same as --mem-per-cpu |
$SLURM_MEM_PER_GPU | Memory per GPU |
$SLURM_MEM_PER_NODE | Memory per node. Same as --mem |
$SLURM_GPUS | Number of GPUs requested |
$SLURM_NTASKS | Same as -n , --ntasks . The number of tasks |
$SLURM_NTASKS_PER_NODE | Number of tasks requested per node |
$SLURM_NTASKS_PER_SOCKET | Number of tasks requested per socket |
$SLURM_NTASKS_PER_CORE | Number of tasks requested per core |
$SLURM_NTASKS_PER_GPU | Number of tasks requested per GPU |
$SLURM_NPROCS | Same as -n , --ntasks . See $SLURM_NTASKS |
$SLURM_NNODES | Total number of nodes in the job’s resource allocation |
$SLURM_TASKS_PER_NODE | Number of tasks to be initiated on each node |
$SLURM_ARRAY_JOB_ID | Job array’s master job ID number |
$SLURM_ARRAY_TASK_ID | Job array ID (index) number |
$SLURM_ARRAY_TASK_COUNT | Total number of tasks in a job array |
$SLURM_ARRAY_TASK_MAX | Job array’s maximum ID (index) number |
$SLURM_ARRAY_TASK_MIN | Job array’s minimum ID (index) number |
MesoBFC Cluster Special variables
Value | Description |
---|---|
$GPU_SCRATCH_DIR | Available only on GPU partition. Local Scratch folder based on SSD. This folder is purged on demand |
$JOBSCRATCH or $TMPDIR | local scratch dir, limited on size. Delete on job exit |
Job States
During its lifetime, a job passes through several states. The most common states
are PENDING
, RUNNING
, SUSPENDED
, COMPLETING
, and COMPLETED
.
State | Description |
---|---|
PD | Pending. Job is waiting for resource allocation |
R | Running. Job has an allocation and is running |
S | Suspended. Execution has been suspended and resources have been released for other jobs |
CA | Cancelled. Job was explicitly cancelled by the user or the system administrator |
CG | Completing. Job is in the process of completing. Some processes on some nodes may still be active |
CD | Completed. Job has terminated all processes on all nodes with an exit code of zero |
F | Failed. Job has terminated with non-zero exit code or other failure condition |
A job may still in pending state for a while. The main reasons are:
(Resources) the job is waiting for resources to become available so that the jobs resource request can be fulfilled.
(Priority) the job is not allowed to run because at least one higher prioritized job is waiting for resources.
(Dependency) the job is waiting for another job to finish first
(--dependency=… option
).
(DependencyNeverSatisfied) the job is waiting for a dependency that can never be satisfied. Such a job will remain pending forever. Please cancel such jobs.
(QOSMaxCpuPerUserLimit) the job is not allowed to start because your currently running jobs consume all allowed CPU resources for your user in a specific partition. Wait for jobs to finish.
Job Monitoring
Use the squeue
command to get a high-level overview of all active
(running and pending) jobs in the cluster.
Use the scontrol
command to show more detailed information about a
job. For example: $ scontrol show job 500
Use jobinfo
or seff
to view information about running/completed job.
For example:
$ jobinfo 1863
Name : 15_15_100
User : xxxxx
Account : yyyyy
Partition : mpi
Nodes : node4-[13-14]
Cores : 48
GPUs : 0
State : COMPLETED
ExitCode : 0:0
Submit : 2022-01-09T00:58:21
Start : 2022-01-09T23:43:24
End : 2022-01-10T08:02:16
Waited : 22:45:03
Reserved walltime : 1-00:00:00
Used walltime : 08:18:52
Used CPU time : 16-15:00:06
% User (Computation): 98.91%
% System (I/O) : 1.09%
Mem reserved : 3872M/core
Max Mem used : 57.63M (node4-13)
Max Disk Write : 1.66G (node4-13)
Max Disk Read : 13.95M (node4-13)
$ seff 1863
Job ID: 1863
Cluster: mesobfc
User/Group: xxxx/yyyyy
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 16-15:00:06
CPU Efficiency: 99.98% of 16-15:05:36 core-walltime
Job Wall-clock time: 08:18:52
Memory Utilized: 2.70 GB (estimated maximum)
Memory Efficiency: 1.49% of 181.50 GB (3.78 GB/core)
To use linux commands (ps
, top
...) to monitor your jobs, you can
connect to nodes using ssh
.
This works only for nodes you have jobs running on them. When the job is finished, you will be kicked out of the node.
Job accounting data
After the job exits from the queue, the sacct
command helps to report
the job statistics. For example
$ sacct -j 10889
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10889 nf-shortV+ smp xxx 6 COMPLETED 0:0
10889.batch batch xxx 6 COMPLETED 0:0
10889.extern extern xxx 6 COMPLETED 0:0
By default, the sacct
command will only bring up information about the
user's job from the current day. By using the --starttime
flag the
command will look further back to the given date e.g. :
sacct --user=$USER --starttime=YYYY-MM-DD
--format
flag can be used to choose the command output:
sacct --user=$USER --format=var_1,var_2, ... ,var_N
Variable | Description |
---|---|
Account | The account the job ran under. |
AveCPU | Average (system + user) CPU time of all tasks in job. |
AveRSS | Average resident set size of all tasks in job. |
AveVMSize | Average Virtual Memory size of all tasks in job. |
CPUTime | Formatted (Elapsed time * CPU) count used by a job or step. |
Elapsed | Jobs elapsed time formated as DD-HH:MM:SS. |
ExitCode | The exit code returned by the job script or salloc. |
JobID | The id of the Job. |
JobName | The name of the Job. |
MaxRSS | Maximum resident set size of all tasks in job. |
MaxVMSize | Maximum Virtual Memory size of all tasks in job. |
MaxDiskRead | Maximum number of bytes read by all tasks in the job. |
MaxDiskWrite | Maximum number of bytes written by all tasks in the job. |
ReqCPUS | Requested number of CPUs. |
ReqMem | Requested amount of memory. |
ReqNodes | Requested number of nodes. |
NCPUS | The number of CPUs used in a job. |
NNodes | The number of nodes used in a job. |
User | The username of the person who ran the job. |
A full list of variables for the --format
flag can be found with the
--helpformat
Job dependencies
A job can be given the constraint that it only starts after another job has finished.
In the following example, we have two Jobs, A
and B
. We want Job B
to start after Job A
has successfully completed.
First we start Job A
by submitting it via sbatch:
sbatch jobA.sh
Making note of the assigned job ID
for Job A
, we then submit Job B
with the added condition that it only starts after Job A
has
successfully completed:
sbatch --dependency=afterok:jobID_A jobB.sh
If we want Job B
to start after several other Jobs have completed, we
can specify additional jobs, using a :
as a delimiter:
sbatch --dependency=afterok:jobID_A:jobID_C:jobID_D jobB.sh
We can also tell slurm to run Job B
, even if Job A
fails, like so:
sbatch --dependency=afterany:jobID_A jobB.sh
Slurm supports a number of different dependency types, for example:
-d after:123456 # job can begin execution after the specified job has begun execution
-d afterany:123456 # job can begin execution after the specified job has finished
-d afternotok:123456 # job can begin execution after the specified job has failed (exit code not equal zero)
-d afterok:123456 # job can begin execution after the specified job has successfully finished (exit code zero)
-d singleton # job can begin execution after any previously job with the same job name and user have finished
One can use -d
option instead of --dependency
Job Array
Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial options (e.g. size, time limit, etc.), however it is possible to change some of these options after the job has begun execution using the scontrol command specifying the JobID of the array or individual ArrayJobID.
Full documentation here
Possibility:
# Submit a job array with index values between 0 and 31
#SBATCH --array=0-31
# Submit a job array with index values of 1, 3, 5 and 7
#SBATCH --array=1,3,5,7
# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
#SBATCH --array=1-7:2
Exemple: process a lot of files
Run a calculation on several input files can be performed in a single job script file:
#!/bin/bash -l
#SBATCH --array=0-9 # 10 jobs for 10 fastq files
INPUTS=(../fastqc/*.fq.gz) # INPUTS is bash array holding files from INPUTS[0] to INPUTS[9]
srun fastqc ${INPUTS[$SLURM_ARRAY_TASK_ID]} ## $SLURM_ARRAY_TASK_ID contains TASK_ID from 0 to 9