A lot of scientific software, codes or libraries can be parallelized via MPI or OpenMP/Multiprocessing.

Avoid submitting inefficient Jobs!

If your code can be parallelized only partially (serial parts remaining), familiarize with Amdahl's law and make sure your Job efficiency is still well above 50%.

Default Values

Slurm parameters like --ntasks and --cpus-per-task default to 1 if omitted.

However, when omitting these Slurm parameters, their corresponding environment variables SLURM_NTASKS and SLURM_CPUS_PER_TASK will not be populated. For this reason you will find export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} in most job templates which sets OMP_NUM_THREADS=1 if SLURM_CPUS_PER_TASK is not defined.

Pure MPI Jobs (n tasks)

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n tasks
#SBATCH --ntasks=n
# Run all tasks on one node
#SBATCH --nodes=1

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

If --nodes=1 is omitted and all cluster nodes are almost full, Slurm might distribute the tasks on a variable number of nodes. Try to avoid this scenario by always setting a fixed number or a range of nodes via --nodes=a or --nodes=a-b with a ≤ b.

srun is the Slurm application launcher/job dispatcher for parallel MPI Jobs and (in this case) inherits all the settings from sbatch . This is the preferred way to start your MPI-parallelized application.

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.

Ensure MPI capability of your application

If your application does not support MPI and and you set --ntasks=n (n >1), then your application is simply started n times needlessly doing the same thing.


Pure MPI Jobs (n×m tasks on m nodes)

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n tasks per node
#SBATCH --ntasks-per-node=n
# Run on m nodes
#SBATCH --nodes=m

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

ProTip: Try to keep the number of nodes as small as possible. If n×m ≤ 128 --nodes=1 is always the best choice. This is due to latency of intra-node MPI communication (shared memory) being about two orders of magnitude lower than inter-node MPI communication (Network/Infiniband)

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.

Pure OpenMP Jobs (n CPUs)

Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have been written using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core.


Below is an appropriate Slurm script for a multithreaded job:

#!/usr/bin/env bash

#SBATCH --job-name=multithreading-example
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n

# set number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

Setting the environment variable OMP_NUM_THREADS is essential. If omitted, your application might assume all cores of a node should be used which causes additional overhead and high load on a node.

For a multithreaded, single-node job make sure that cpus-per-task is equal to or less than the number of CPU-cores on a node, otherwise Slurm will refuse to accept your job submission.

Important

Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead, doing so will waste resources and cause your next job submission to have a lower priority.

Pure Multiprocessing Jobs (n CPUs)

Sometimes software handles parallel computation by forking (processes created by the main process) a certain number of processes that will do the work. Contrary to MPI where Slurm is responsible for launching, these processes are controlled by your application. Examples are Python (multiprocessing, joblib), R (doParallel), julia (distributed) and many others. Such kind of jobs are almost always restricted to a single node, because processes on other nodes must involve Slurm to launch them (at least the main process).


Below is an appropriate Slurm script for a multiprocessing job:

#!/usr/bin/env bash

#SBATCH --job-name=multiprocessing-example
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n

# Limit number of OpenMP threads to 1 to avoid multiple levels of parallelism
export OMP_NUM_THREADS=1

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program


Setting the environment variable OMP_NUM_THREADS=1 is essential. If omitted and the program also supports multithreading, each process might assume all cores of a node should be used which causes additional overhead and high load on a node due to multiple levels of parallelism (n*128 CPU Cores)

Since your application is in control how many processes will be created, make sure (in your program code or input file) that only n processes will be created. A mismatch with create either a waste of resources or additional overhead to over-subscription.

Hybrid MPI+OpenMP Jobs (n×m×p CPUs over m×p Tasks on p Nodes)

Many codes combine multithreading with multinode parallelism using a hybrid OpenMP/MPI approach. Below is a Slurm script appropriate for such a code:

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request m tasks per node
#SBATCH --ntasks-per-node=m
# Run on p nodes
#SBATCH --nodes=p

# Load application module here if necessary

# set number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# No need to pass number of tasks to srun
srun my_program

Make sure your code actually supportes this mode of operation of combined MPI + OpenMP parallelism.

discouraged use of mpirun

The use of mpirun is heavily discouraged when queuing your Job via Slurm.