Bases of Slurm
MesoBFC cluster uses Slurm (Simple Linux Utility for Resource Management) for cluster/resource management and job scheduling.
Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources and scheduling work for future execution.
Slurm job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.
- Jobs are typically created with the
sbatchcommand - Steps are created with the
sruncommand - Tasks are requested on the job level with
--ntasks(-n) or--ntasks-per-node - CPUs are requested per task with
--cpus-per-taskor-c.
Slurm main commands
| Command | Description |
|---|---|
sbatch job.slurm | Submit batch job |
srun <command> | Run step or program inside job |
scontrol show job <jobid> | Show information of a running jobid |
sacct -j <jobid> <options> | Show information of a completed jobid |
seff <jobid> | A tool to display the efficiency (CPU, MEM) of completed jobs |
scancel <jobid> | Cancel/kill jobs |
squeue | Show running/pending jobs |
sinfo | Show cluster informations |
For more information about Slurm: https://slurm.schedmd.com/quickstart.html
Slurm configuration
In Slurm, multiple nodes can be groupes into partitions which are sets of nodes aggregated by shared characteristics or objectives, with associated limits for wall-clock time, job size, etc.
| Partition | Description | #Nodes (cores) | #GPU (type) | TimeLimit (default) | Limits |
|---|---|---|---|---|---|
| mpi1 | MPI based applications (avx512) | 36 (864) | - | TODO! | TODO! |
| mpi2 | MPI based applications (icelake) | 48 (2304) | - | TODO! | TODO! |
| gpu | Deep and machine learning applications | 17 (544) | 2x17 (A100) | TODO! | TODO! |
MPI partitions
These partition are used for parallel MPI applications.
- Number of MPI tasks (
-noption) must be divisible by 24 formpi1, by 48 formpi2. For exemple 24, 48… formpi1, 48, 96… formpi2. - All node memory is allocated, since whole nodes are requested.
- Do not use
mpiruncommand to execute application, usesruncommand instead.
For MPI applications, do not use --node or -N Slurm option.
Use --ntasks or -n instead to request tasks from nodes.
Here is an example of MPI job to be adapted to your needs:
Caution : be sure to add "-l" flag to bash (setup environment)
#!/bin/bash -l
# Fichier submission.SBATCH
#SBATCH --job-name="MPI_JOB"
#SBATCH --output=%x.%J.out ## %x=job name, %J=job id
#SBATCH --error=%x.%J.out
# walltime (hh:mm::ss) max is 8 days
#SBATCH --time=24:00:00
#SBATCH --partition=mpi1
#SBATCH --ntasks=48 ## request 48 MPI tasks
#SBATCH --mem=0 ## request all nodes memory
# votre adresse mail pour les notifs
#SBATCH --mail-user=votreadresseufc@univ-fcomte.fr ### PLEASE EDIT ME
#SBATCH --mail-type=END,FAIL # notify when job end/fail
module purge
module load openmpi-4.1.5/gcc-13.1.0
srun ./myMPI_Application listParams
Submit your job to Slurm:
$ sbatch mpi.slurm
GPU partition
This partition provides 17 AMD (Zen3) nodes, each have the following settings:
- 252 GB of memory
- 32 cores
- 2x Nvidia A100
Slurm partition settings:
- Limit TODO! GPU per user
- Limit TODO! GB memory per job
- TODO! GB default memory allowed per CPU (use
--mem=72Gfor example to request more or--mem-per-cpu=32Gto request more per CPU) - Partition TimeLimit is TODO! days (
TODO!-00:00:00) - Default job WallTime is TODO! hours (
--time TODO!:00:00) if is not given (use--timeoption to override this value)
Interactive mode
Objective: request GPU resource and open an interactive shell.
Example:
$ srun --partition=gpu --time=4:00:00 --gres=gpu:1 --pty bash
where
--partionuse gpu partition--timerequest 4h (format HH:MM:SS)--gres=gpu:1request 1 GPU on the node--pty bashopen bash session
Batch mode
Here is an example of Slurm script to use:
#!/bin/bash -l
# Fichier submission.SBATCH
#SBATCH --job-name="GPU_JOB"
#SBATCH --output=%x.%J.out ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out
# walltime (hh:mm::ss) max is 8 days
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
## To request more memory, use --mem option.
## Please don't use more than 128g.
#SBATCH --mem-per-cpu=32G
## votre adresse mail pour les notifs
#SBATCH --mail-user=votreadresseufc@univ-fcomte.fr
#SBATCH --mail-type=END,FAIL
## view allocated GPU cards
nvidia-smi
module purge
module load miniconda3-22.11.1/gcc-13.1.0
conda activate your_env && python GPU_program.py
Submit job to Slurm:
$ sbatch gpu.slurm
Check job status:
$ squeue