A lot of scientific software, codes or libraries can be parallelized via MPI or OpenMP/Multiprocessing.

Avoid submitting inefficient Jobs!

If your code can be parallelized only paritially (serial parts remaining), familiarize with Amdahl's law and make sure your Job efficiency is still well above 50%.

Default Values

Slurm parameters like --ntasks and --cpus-per-task default to 1 if omitted.

Pure MPI Jobs (n tasks)

#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n tasks per node
#SBATCH --ntasks=n
# If possible, run all tasks on one node
#SBATCH --nodes=1

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

If --nodes=1 is omitted and all cluster nodes are almost full, Slurm might distribute a variable number of tasks on a variable number of nodes. Try to avoid this scenario by always setting a minimal number of nodes via --nodes.

srun is the Slurm application launcher/job dispatcher for parallel MPI Jobs and (in this case) inherits all the settings from sbatch . This is the preferred way to start your MPI-parallelized application.

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.

Pure MPI Jobs (n×m tasks on m nodes)

#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n tasks per node
#SBATCH --ntasks-per-node=n
# Run on m nodes
#SBATCH --nodes=m

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

ProTip: Try to keep the number of nodes as small as possible. If n×m ≤ 128 --nodes=1 is always the best choice. This is due to latency of intra-node MPI communication (shared memory) being about two orders of magnitude lower than inter-node MPI communication (Network/Infiniband)

Pure OpenMP Jobs (n CPUs)

Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have been written using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core.

Below is an appropriate Slurm script for a multithreaded job:

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n

# set number of OpenMP threads

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

Setting the environment variable OMP_NUM_THREADS is essential. If omitted, your application might assume all cores of a node should be used which causes additional overhead and high load on a node.

For a multithreaded, single-node job make sure that the product of ntasks and cpus-per-task is equal to or less than the number of CPU-cores on a node. Use the "snodes" command and look at the "CPUS" column to see the CPU-cores per node information.


Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead, doing so will waste resources and cause your next job submission to have a lower priority.

Hybrid MPI+OpenMP Jobs (n×m×p CPUs over n×m Tasks on p Nodes)

Many codes combine multithreading with multinode parallelism using a hybrid OpenMP/MPI approach. Below is a Slurm script appropriate for such a code:

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request n tasks per node
#SBATCH --ntasks-per-node=m
# Run on m nodes
#SBATCH --nodes=p

# Load application module here if necessary

# set number of OpenMP threads

# No need to pass number of tasks to srun
srun my_program

Make sure your code actually supportes this mode of operation of combined MPI + OpenMP parallelism..

Environment variables for different MPI flavors

Intel-MPI (impi)
export I_MPI_PMI_LIBRARY=/hpc/gpfs2/sw/pmi2/current/lib/libpmi2.so
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=mlx
export SLURM_MPI_TYPE=pmi2

# or more simply:

module load impi-envvars
OpenMP (ompi)
export SLURM_MPI_TYPE=pmix_v4 # or pmix_v3 or pmix_v2 depending on what your self-compiled OpenMPI version supports

For modules provided by the HPC-Team these variables are most lilely already set in the corresponding module definition.