A lot of scientific software, codes or libraries can be parallelized via MPI or OpenMP/Multiprocessing.

Avoid submitting inefficient Jobs!

If your code can be parallelized only paritially (serial parts remaining), familiarize with Amdahl's law and make sure your Job efficiency is still well above 50%.

Default Values

Slurm parameters like --ntasks and --cpus-per-task default to 1 if omitted.

Pure MPI Jobs (n tasks)

#!/usr/bin/env bash
 
#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n tasks per node
#SBATCH --ntasks=n
# If possible, run all tasks on one node
#SBATCH --nodes=1

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

If --nodes=1 is omitted and all cluster nodes are almost full, Slurm might distribute a variable number of tasks on a variable number of nodes. Try to avoid this scenario by always setting a minimal number of nodes via --nodes.

srun is the Slurm application launcher/job dispatcher for parallel MPI Jobs and (in this case) inherits all the settings from sbatch . This is the preferred way to start your MPI-parallelized application.

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.

Pure MPI Jobs (n×m tasks on m nodes)

#!/usr/bin/env bash
 
#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n tasks per node
#SBATCH --ntasks-per-node=n
# Run on m nodes
#SBATCH --nodes=m

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

ProTip: Try to keep the number of nodes as small as possible. If n×m ≤ 128 --nodes=1 is always the best choice. This is due to latency of intra-node MPI communication (shared memory) being about two orders of magnitude lower than inter-node MPI communication (Network/Infiniband)

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.

Pure OpenMP Jobs (n CPUs)

Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have been written using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core.


Below is an appropriate Slurm script for a multithreaded job:

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n

# set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

Setting the environment variable OMP_NUM_THREADS is essential. If omitted, your application might assume all cores of a node should be used which causes additional overhead and high load on a node.

For a multithreaded, single-node job make sure that the product of ntasks and cpus-per-task is equal to or less than the number of CPU-cores on a node. Use the "snodes" command and look at the "CPUS" column to see the CPU-cores per node information.

Important

Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead, doing so will waste resources and cause your next job submission to have a lower priority.

Hybrid MPI+OpenMP Jobs (n×m×p CPUs over n×m Tasks on p Nodes)

Many codes combine multithreading with multinode parallelism using a hybrid OpenMP/MPI approach. Below is a Slurm script appropriate for such a code:

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request n tasks per node
#SBATCH --ntasks-per-node=m
# Run on m nodes
#SBATCH --nodes=p

# Load application module here if necessary

# set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# No need to pass number of tasks to srun
srun my_program

Make sure your code actually supportes this mode of operation of combined MPI + OpenMP parallelism..

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.

Environment variables for different MPI flavors

Intel-MPI (impi)
export I_MPI_PMI_LIBRARY=/hpc/gpfs2/sw/pmi2/current/lib/libpmi2.so
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=mlx
export SLURM_MPI_TYPE=pmi2

# or more simply:

module load impi-envvars
OpenMP (ompi)
export SLURM_MPI_TYPE=pmix_v4 # or pmix_v3 or pmix_v2 depending on what your self-compiled OpenMPI version supports

For modules provided by the HPC-Team these variables are most lilely already set in the corresponding module definition.