A lot of scientific software, codes or libraries can be parallelized via MPI or OpenMP/Multiprocessing.

Avoid submitting inefficient Jobs!

If your code can be parallelized only partially (serial parts remaining), familiarize with Amdahl's law and make sure your Job efficiency is still well above 50%.

Default Values

Slurm parameters like --ntasks and --cpus-per-task default to 1 if omitted.

However, when omitting these Slurm parameters, their corresponding environment variables SLURM_NTASKS and SLURM_CPUS_PER_TASK will not be populated. For this reason you will find export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} in most job templates which sets OMP_NUM_THREADS=1 if SLURM_CPUS_PER_TASK is not defined.

Pure OpenMP Jobs (n CPUs)

Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have been written using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core.


Below is an appropriate Slurm script for a multithreaded job:

#!/usr/bin/env bash

#SBATCH --job-name=multithreading-example
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n

# set number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

Setting the environment variable OMP_NUM_THREADS is essential. If omitted, your application might assume all cores of a node should be used which causes additional overhead and high load on a node.

For a multithreaded, single-node job make sure that cpus-per-task is equal to or less than the number of CPU-cores on a node, otherwise Slurm will refuse to accept your job submission.

Important

Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead, doing so will waste resources and cause your next job submission to have a lower priority.

Pure Multiprocessing Jobs (n CPUs)

Sometimes software handles parallel computation by forking (processes created by the main process) a certain number of processes that will do the work. Contrary to MPI where Slurm is responsible for launching, these processes are controlled by your application. Examples are Python (multiprocessing, joblib), R (doParallel), julia (distributed) and many others. Such kind of jobs are almost always restricted to a single node, because processes on other nodes must involve Slurm to launch them (at least the main process).


Below is an appropriate Slurm script for a multiprocessing job:

#!/usr/bin/env bash

#SBATCH --job-name=multiprocessing-example
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n

# Limit number of OpenMP threads to 1 to avoid multiple levels of parallelism
export OMP_NUM_THREADS=1

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program


Setting the environment variable OMP_NUM_THREADS=1 is essential. If omitted and the program also supports multithreading, each process might assume all cores of a node should be used which causes additional overhead and high load on a node due to multiple levels of parallelism (n*128 CPU Cores)

Since your application is in control how many processes will be created, make sure (in your program code or input file) that only n processes will be created. A mismatch with create either a waste of resources or additional overhead to over-subscription.