A lot of scientific software, codes or libraries can be parallelized via MPI or OpenMP/Multiprocessing. Avoid submitting inefficient Jobs! If your code can be parallelized only partially (serial parts remaining), familiarize with Amdahl's law and make sure your Job efficiency is still well above 50%. Default Values Slurm parameters like However, when omitting these Slurm parameters, their corresponding environment variables SLURM_NTASKS and SLURM_CPUS_PER_TASK will not be populated. For this reason you will find --ntasks and --cpus-per-task default to 1 if omitted.export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} in most job templates which sets OMP_NUM_THREADS=1 if SLURM_CPUS_PER_TASK is not defined.
Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have been written using shared-memory parallel programming models like OpenMP, pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core. Below is an appropriate Slurm script for a multithreaded job: Setting the environment variable For a multithreaded, single-node job make sure that cpus-per-task is equal to or less than the number of CPU-cores on a node, otherwise Slurm will refuse to accept your job submission. Important Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead, doing so will waste resources and cause your next job submission to have a lower priority. Sometimes software handles parallel computation by forking (processes created by the main process) a certain number of processes that will do the work. Contrary to MPI where Slurm is responsible for launching, these processes are controlled by your application. Examples are Python (multiprocessing, joblib), R (doParallel), julia (distributed) and many others. Such kind of jobs are almost always restricted to a single node, because processes on other nodes must involve Slurm to launch them (at least the main process). Below is an appropriate Slurm script for a multiprocessing job: Setting the environment variable Since your application is in control how many processes will be created, make sure (in your program code or input file) that only Pure OpenMP Jobs (n CPUs)
#!/usr/bin/env bash
#SBATCH --job-name=multithreading-example
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0
# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# set number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Load application module here if necessary
# No need to pass number of tasks to srun
srun my_program
OMP_NUM_THREADS is essential. If omitted, your application might assume all cores of a node should be used which causes additional overhead and high load on a node.Pure Multiprocessing Jobs (n CPUs)
#!/usr/bin/env bash
#SBATCH --job-name=multiprocessing-example
#SBATCH --partition=epyc
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0
# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Limit number of OpenMP threads to 1 to avoid multiple levels of parallelism
export OMP_NUM_THREADS=1
# Load application module here if necessary
# No need to pass number of tasks to srun
srun my_program
OMP_NUM_THREADS=1 is essential. If omitted and the program also supports multithreading, each process might assume all cores of a node should be used which causes additional overhead and high load on a node due to multiple levels of parallelism (n*128 CPU Cores)n processes will be created. A mismatch with create either a waste of resources or additional overhead to over-subscription.