Skip to main content

Bases of Slurm

MesoBFC cluster uses Slurm (Simple Linux Utility for Resource Management) for cluster/resource management and job scheduling.

Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources and scheduling work for future execution.

Slurm terminology

Slurm job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.

  • Jobs are typically created with the sbatch command
  • Steps are created with the srun command
  • Tasks are requested on the job level with --ntasks (-n) or --ntasks-per-node
  • CPUs are requested per task with --cpus-per-task or -c.

Slurm main commands

CommandDescription
sbatch job.slurmSubmit batch job
srun <command>Run step or program inside job
scontrol show job <jobid>Show information of a running jobid
sacct -j <jobid> <options>Show information of a completed jobid
seff <jobid>A tool to display the efficiency (CPU, MEM) of completed jobs
scancel <jobid>Cancel/kill jobs
squeueShow running/pending jobs
sinfoShow cluster informations

For more information about Slurm: https://slurm.schedmd.com/quickstart.html

Slurm configuration

In Slurm, multiple nodes can be groupes into partitions which are sets of nodes aggregated by shared characteristics or objectives, with associated limits for wall-clock time, job size, etc.

PartitionDescription#Nodes (cores)#GPU (type)TimeLimit (default)Limits
mpi1MPI based applications (avx512)36 (864)-TODO!TODO!
mpi2MPI based applications (icelake)48 (2304)-TODO!TODO!
gpuDeep and machine learning applications17 (544)2x17 (A100)TODO!TODO!

MPI partitions

These partition are used for parallel MPI applications.

  • Number of MPI tasks (-n option) must be divisible by 24 for mpi1, by 48 for mpi2. For exemple 24, 48… for mpi1, 48, 96… for mpi2.
  • All node memory is allocated, since whole nodes are requested.
  • Do not use mpirun command to execute application, use srun command instead.
caution

For MPI applications, do not use --node or -N Slurm option. Use --ntasks or -n instead to request tasks from nodes.

Here is an example of MPI job to be adapted to your needs:

Caution : be sure to add "-l" flag to bash (setup environment)

mpi.slurm
#!/bin/bash -l
# Fichier submission.SBATCH

#SBATCH --job-name="MPI_JOB"
#SBATCH --output=%x.%J.out ## %x=job name, %J=job id
#SBATCH --error=%x.%J.out

# walltime (hh:mm::ss) max is 8 days
#SBATCH --time=24:00:00

#SBATCH --partition=mpi1

#SBATCH --ntasks=48 ## request 48 MPI tasks
#SBATCH --mem=0 ## request all nodes memory

# votre adresse mail pour les notifs
#SBATCH --mail-user=votreadresseufc@univ-fcomte.fr ### PLEASE EDIT ME
#SBATCH --mail-type=END,FAIL # notify when job end/fail

module purge
module load openmpi-4.1.5/gcc-13.1.0

srun ./myMPI_Application listParams

Submit your job to Slurm:

$ sbatch mpi.slurm

GPU partition

This partition provides 17 AMD (Zen3) nodes, each have the following settings:

  • 252 GB of memory
  • 32 cores
  • 2x Nvidia A100

Slurm partition settings:

  • Limit TODO! GPU per user
  • Limit TODO! GB memory per job
  • TODO! GB default memory allowed per CPU (use --mem=72G for example to request more or --mem-per-cpu=32G to request more per CPU)
  • Partition TimeLimit is TODO! days (TODO!-00:00:00)
  • Default job WallTime is TODO! hours (--time TODO!:00:00) if is not given (use --time option to override this value)

Interactive mode

Objective: request GPU resource and open an interactive shell.

Example:

$ srun --partition=gpu --time=4:00:00 --gres=gpu:1 --pty bash

where

  • --partion use gpu partition
  • --time request 4h (format HH:MM:SS)
  • --gres=gpu:1 request 1 GPU on the node
  • --pty bash open bash session

Batch mode

Here is an example of Slurm script to use:

gpu.slurm
#!/bin/bash -l
# Fichier submission.SBATCH

#SBATCH --job-name="GPU_JOB"
#SBATCH --output=%x.%J.out ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out

# walltime (hh:mm::ss) max is 8 days
#SBATCH --time=24:00:00

#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

## To request more memory, use --mem option.
## Please don't use more than 128g.
#SBATCH --mem-per-cpu=32G

## votre adresse mail pour les notifs
#SBATCH --mail-user=votreadresseufc@univ-fcomte.fr
#SBATCH --mail-type=END,FAIL

## view allocated GPU cards
nvidia-smi

module purge
module load miniconda3-22.11.1/gcc-13.1.0

conda activate your_env && python GPU_program.py

Submit job to Slurm:

$ sbatch gpu.slurm

Check job status:

$ squeue