Bases of Slurm

MesoBFC cluster uses Slurm (Simple Linux Utility for Resource Management) for cluster/resource management and job scheduling.

Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources and scheduling work for future execution.

Slurm terminology

Slurm job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.

Jobs are typically created with the sbatch command
Steps are created with the srun command
Tasks are requested on the job level with --ntasks (-n) or --ntasks-per-node
CPUs are requested per task with --cpus-per-task or -c.

Slurm main commands

Command	Description
`sbatch job.slurm`	Submit batch job
`srun <command>`	Run step or program inside job
`scontrol show job <jobid>`	Show information of a running jobid
`sacct -j <jobid> <options>`	Show information of a completed jobid
`seff <jobid>`	A tool to display the efficiency (CPU, MEM) of completed jobs
`scancel <jobid>`	Cancel/kill jobs
`squeue`	Show running/pending jobs
`sinfo`	Show cluster informations

For more information about Slurm: https://slurm.schedmd.com/quickstart.html

Slurm configuration

In Slurm, multiple nodes can be groupes into partitions which are sets of nodes aggregated by shared characteristics or objectives, with associated limits for wall-clock time, job size, etc.

Partition	Description	#Nodes (cores)	#GPU (type)	TimeLimit (default)	Limits
mpi1	MPI based applications (avx512)	36 (864)	-	TODO!	TODO!
mpi2	MPI based applications (icelake)	48 (2304)	-	TODO!	TODO!
gpu	Deep and machine learning applications	17 (544)	2x17 (A100)	TODO!	TODO!

MPI partitions

These partition are used for parallel MPI applications.

Number of MPI tasks (-n option) must be divisible by 24 for mpi1, by 48 for mpi2. For exemple 24, 48… for mpi1, 48, 96… for mpi2.
All node memory is allocated, since whole nodes are requested.
Do not use mpirun command to execute application, use srun command instead.

caution

For MPI applications, do not use --node or -N Slurm option. Use --ntasks or -n instead to request tasks from nodes.

Here is an example of MPI job to be adapted to your needs:

Caution : be sure to add "-l" flag to bash (setup environment)

mpi.slurm
#!/bin/bash -l
# Fichier submission.SBATCH

#SBATCH --job-name="MPI_JOB"
#SBATCH --output=%x.%J.out   ## %x=job name, %J=job id
#SBATCH --error=%x.%J.out

# walltime (hh:mm::ss) max is 8 days
#SBATCH --time=24:00:00

#SBATCH --partition=mpi1

#SBATCH --ntasks=48  ## request 48 MPI tasks
#SBATCH --mem=0  ## request all nodes memory

# votre adresse mail pour les notifs
#SBATCH --mail-user=votreadresseufc@univ-fcomte.fr ### PLEASE EDIT ME
#SBATCH --mail-type=END,FAIL   # notify when job end/fail

module purge
module load openmpi-4.1.5/gcc-13.1.0

srun ./myMPI_Application listParams

Submit your job to Slurm:

$ sbatch mpi.slurm

GPU partition

This partition provides 17 AMD (Zen3) nodes, each have the following settings:

252 GB of memory
32 cores
2x Nvidia A100

Slurm partition settings:

Limit TODO! GPU per user
Limit TODO! GB memory per job
TODO! GB default memory allowed per CPU (use --mem=72G for example to request more or --mem-per-cpu=32G to request more per CPU)
Partition TimeLimit is TODO! days (TODO!-00:00:00)
Default job WallTime is TODO! hours (--time TODO!:00:00) if is not given (use --time option to override this value)

Interactive mode

Objective: request GPU resource and open an interactive shell.

Example:

$ srun --partition=gpu --time=4:00:00 --gres=gpu:1 --pty bash

where

--partion use gpu partition
--time request 4h (format HH:MM:SS)
--gres=gpu:1 request 1 GPU on the node
--pty bash open bash session

Batch mode

Here is an example of Slurm script to use:

gpu.slurm
#!/bin/bash -l
# Fichier submission.SBATCH

#SBATCH --job-name="GPU_JOB"
#SBATCH --output=%x.%J.out   ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out
 
# walltime (hh:mm::ss) max is 8 days
#SBATCH --time=24:00:00
 
#SBATCH --partition=gpu 
#SBATCH --gres=gpu:1
 
## To request more memory, use --mem option.  
## Please don't use more than 128g.
#SBATCH --mem-per-cpu=32G
 
## votre adresse mail pour les notifs 
#SBATCH --mail-user=votreadresseufc@univ-fcomte.fr
#SBATCH --mail-type=END,FAIL
 
## view allocated GPU cards 
nvidia-smi
 
module purge
module load miniconda3-22.11.1/gcc-13.1.0

conda activate your_env && python GPU_program.py

Submit job to Slurm:

$ sbatch gpu.slurm

Check job status:

$ squeue

Slurm main commands​

Slurm configuration​

MPI partitions​

GPU partition​

Interactive mode​

Batch mode​