Requesting a single GPU

GPUs may be requested via Slurm via the epyc-gpu partition and with additional parameters requesting GPU ressources.

  • GPU ressources can only be requested in combination with CPU ressources.
  • Most of the time not all parts of the application can be run on the GPU and significant parts have to be run on the CPU as well. CPU-usage is then usually OpenMP parallelized, but this is not strictly needed.
  • Slurm allows up to X CPUs to be requested per GPU.
  • There are 3 GPUs available per Node.


Requesting single-GPU ressources
#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request GPU Ressources (model:number)
#SBATCH --gpus=a100:1

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program


For applications requiring both intensive GPU as well as CPU usage, it is recommended not to request more than 32 CPUs/OpenMP Threads per Job. There are two reasons:

  1. The GPU nodes (partition epyc-gpu ) have the same CPU configuration (2 sockets, 64 CPUs per socket) as the pure CPU nodes (partition epyc ). Requesting more than 32 CPUs significantly increases the propability (especially when the nodes are already partially occupied) that not all of them can be placed on the same CPU socket. Inter-CPU (different socket) communication has an almost 3-fold latency compared to Intra-CPU (same socket) communication. For some applications using OpenMP threading we have observed a slowdown of about 2x when using 64 CPUs (distributed across two sockets) vs 32 CPUs (distributed on one socket).
  2. Having 3 GPUs and 2 CPU sockets, one GPU will be unusable when for example two Jobs with 64 CPUs per GPU are requested.


Make sure that your program can acutally make use of a GPU. Otherwise GPUs stay idle and GPU ressources are blocked for others.

Requesting multiple GPUs

Multiple GPUs involving multiple nodes should not be requested at this time if excessive direct GPU-GPU communication is needed. The communication bandwidth is low and latency is high compared to same-node GPU-GPU communication.

If you require excessive direct GPU-GPU communication, the nodes of the epyc-gpu-sxm partition are recommended, because they are connected to one another via NV-Link. 

When running Multi-GPU Jobs, it is mandatory to run small benchmarks in order to validate scaling efficiency!

Usage of --gpus=n with n > 1

When using --gpus=2 it can happen that the two GPUs are on differen Nodes, even when --ntasks=1 . This cannot be handled by applications that do not support MPI (and even then, it would be a very odd setup). 

Another example: running Jobs with --gpus=a100:5 with --ntasks=5 is not very explicit, it could be possible that Slurm might shedule the tasks and GPU not in a 1:1 relationship, for example 2+3 GPU on 3+2 CPU Tasks, when two nodes with 3 GPUs each are involved.

It is better to use one of the following options:

  1. --ntasks=1 (implicit or explicit) and all your requested GPUs can fit on one Node, either
    • set --nnodes=1  or
    • set --gpus-per-task=n (explicit --ntasks=1 needed as well)
  2. --ntasks=n ( > 1, MPI mandatory!)
    • set --gpus-per-task=m (our nodes support 1 ≤ m ≤ 4 at the moment). Each group of m  GPU(s) will be bound to their task. Use --gpu-bind=none to lift the binding if needed.
    • set --gpus-per-node=m--ntasks-per-node=k and replace --ntasks=n  for --nnodes=n   for requesting n × m × k  GPUs over n  nodes (the restriction k × m ≤ 4  applies on partition epyc-gpu-sxm and k × m ≤ 3   applies on epyc-gpu). Note that the tasks will "see" all k GPUs on their respective Node.
Requesting multi-GPU ressources (2 GPUs) without MPI
#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Explicitly request one Task
#SBATCH --ntasks=1
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request GPU Ressources (model:number)
#SBATCH --gpus-per-task=a100:2

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program
Requesting multi-GPU ressources (2 GPUs) with MPI
#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Explicitly request the number of Task
#SBATCH --ntasks=2
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request GPU Ressources (model:number)
#SBATCH --gpus-per-task=a100:1

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

For better performance


Requesting GPU ressources
#SBATCH --gres-flags=enforce-binding 

If a job specifies --gres-flags=enforce-binding, then only the identified cores can be allocated with each generic resource. This will tend to improve performance of jobs, but delay the allocation of resources to them. If specified and a job is not submitted with the --gres-flags=enforce-binding option the identified cores will be preferred for scheduled with each generic resource.


If --gres-flags=disable-binding is specified, then any core can be used with the resources, which also increases the speed of Slurm's scheduling algorithm but can degrade the application performance. The --gres-flags=disable-binding option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). If any core can be effectively used with the resources, then do not specify the cores option for improved speed in the Slurm scheduling logic. A restart of the slurmctld is needed for changes to the Cores option to take effect.