Requesting a single GPU

GPUs may be requested via Slurm via the epyc-gpu partition and with additional parameters requesting GPU ressources.

GPU ressources can only be requested in combination with CPU ressources.
Most of the time not all parts of the application can be run on the GPU and significant parts have to be run on the CPU as well. CPU-usage is then usually OpenMP parallelized, but this is not strictly needed.
Slurm allows up to X CPUs to be requested per GPU.
There are 3 GPUs available per Node.

Requesting single-GPU ressources

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request GPU Ressources (model:number)
#SBATCH --gpus=a100:1

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application modules here if necessary

# No need to pass number of tasks to srun
srun my_program

For applications requiring both intensive GPU as well as CPU usage, it is recommended not to request more than 32 CPUs/OpenMP Threads per Job. There are two reasons:

The GPU nodes (partition epyc-gpu ) have the same CPU configuration (2 sockets, 64 CPUs per socket) as the pure CPU nodes (partition epyc ). Requesting more than 32 CPUs significantly increases the propability (especially when the nodes are already partially occupied) that not all of them can be placed on the same CPU socket. Inter-CPU (different socket) communication has an almost 3-fold latency compared to Intra-CPU (same socket) communication. For some applications using OpenMP threading we have observed a slowdown of about 2x when using 64 CPUs (distributed across two sockets) vs 32 CPUs (distributed on one socket).
Having 3 GPUs and 2 CPU sockets, one GPU will be unusable when for example two Jobs with 64 CPUs per GPU are requested.

Make sure that your program can acutally make use of a GPU. Otherwise GPUs stay idle and GPU ressources are blocked for others.

Requesting multiple GPUs

Multiple GPUs involving multiple nodes should not be requested at this time if excessive direct GPU-GPU communication is needed. The communication bandwidth is low and latency is high compared to same-node GPU-GPU communication.

If you require excessive direct GPU-GPU communication, the nodes of the epyc-gpu-sxm partition are recommended, because they are connected to one another via NV-Link.

When running Multi-GPU Jobs, it is mandatory to run small benchmarks in order to validate scaling efficiency!

Usage of --gpus=n with n > 1

When using --gpus=2 it can happen that the two GPUs are on differen Nodes, even when --ntasks=1 . This cannot be handled by applications that do not support MPI (and even then, it would be a very odd setup).

Another example: running Jobs with --gpus=a100:5 with --ntasks=5 is not very explicit, it could be possible that Slurm might shedule the tasks and GPU not in a 1:1 relationship, for example 2+3 GPU on 3+2 CPU Tasks, when two nodes with 3 GPUs each are involved.

It is better to use one of the following options:

--ntasks=1 (implicit or explicit) and all your requested GPUs can fit on one Node, either
- set --nnodes=1 or
- set --gpus-per-task=n (explicit --ntasks=1 needed as well)
--ntasks=n ( n > 1, MPI mandatory!)
- set --gpus-per-task=m (our nodes support 1 ≤ m ≤ 4 at the moment). Each group of m GPU(s) will be bound to their task. Use --gpu-bind=none to lift the binding if needed.
- set --gpus-per-node=m, --ntasks-per-node=k and replace --ntasks=n for --nnodes=n for requesting n × m × k GPUs over n nodes (the restriction k × m ≤ 4 applies on partition epyc-gpu-sxm and k × m ≤ 3 applies on epyc-gpu). Note that the tasks will "see" all k GPUs on their respective Node.

Requesting multi-GPU ressources (2 GPUs) without MPI

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Explicitly request one Task
#SBATCH --ntasks=1
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request GPU Ressources (model:number)
#SBATCH --gpus-per-task=a100:2

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application modules here if necessary

# No need to pass number of tasks to srun
srun my_program

Requesting multi-GPU ressources (2 GPUs) with MPI

#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=<e-mail address>
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Explicitly request the number of Task
#SBATCH --ntasks=2
# Request n CPUs for your task.
#SBATCH --cpus-per-task=n
# Request GPU Ressources (model:number)
#SBATCH --gpus-per-task=a100:1

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application module here if necessary

# No need to pass number of tasks to srun
srun my_program

Enforce resource bindung for better performance

Requesting GPU ressources

#SBATCH --gres-flags=enforce-binding

If a job specifies --gres-flags=enforce-binding, then only the identified cores can be allocated with each generic resource. This will tend to improve performance of jobs, but delay the allocation of resources to them. If specified and a job is not submitted with the --gres-flags=enforce-binding option the identified cores will be preferred for scheduled with each generic resource.

If --gres-flags=disable-binding is specified, then any core can be used with the resources, which also increases the speed of Slurm's scheduling algorithm but can degrade the application performance. The --gres-flags=disable-binding option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). If any core can be effectively used with the resources, then do not specify the cores option for improved speed in the Slurm scheduling logic. A restart of the slurmctld is needed for changes to the Cores option to take effect.

CUDA Multi-Process-Server for maximum efficiency

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically but not necessarily MPI jobs. MPS can increase job efficiency and throughput manyfold when the GPU compute capacity is not fully saturated by a single application process (see also when to use MPS).

Real example: Python script working with cuPy and small matrices (~100x100)

Tasks per GPU	without MPS		with MPS
Tasks per GPU	Total Time per Job	Efficiency	Total Time per Job	Efficiency
1	13h09m	100% (reference)	13h09m	100% (reference)
4	1d2h20m	200%	13h55m	378%
5	1d10h38m	192%	13h58m	471%
10	2d23h05m	185%	21h24m	615%

A module cuda-mps is provided which will start the MPS server when loading the module and will stop the MPS server when the module is unloaded. This also works with Multinode MPI Jobs. In this case one MPS daemon is started per node.

...
# Start the MPS daemon
module load cuda-mps

# Do the work...

# Stop the MPS daemon
module unload cuda-mps

Stopping of the MPS daemon using module unload cuda-mps in mandatory, otherwise the job will run indefinitly until a timeout happens.

Requesting single-GPU ressources with Multi-Process-Service and two processes

#!/usr/bin/env bash

#SBATCH --job-name=mps-job
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request 2 Tasks.
#SBATCH --ntasks=2
# Request 2 CPUs for per task.
#SBATCH --cpus-per-task=2
# Request GPU Ressources (model:number)
#SBATCH --gpus=a100:1

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application modules here if necessary

# Start the MPS daemon
module load cuda-mps

# Running multiple tasks in the background
srun --ntasks=1 --overlap --exact python script.py arg1 &
srun --ntasks=1 --overlap --exact python script.py arg2 &

# Wait until all tasks are completed
wait

# Stop the MPS daemon
module unload cuda-mps

Advanced example with 1 GPU, 10 Tasks, NVIDIA MPS with GNU parallel and Job Arrays

#!/usr/bin/env bash

#SBATCH --job-name=advanced-mps-job
#SBATCH --partition=epyc-gpu
#SBATCH --mail-type=END,INVALID_DEPEND
#SBATCH --mail-user=noreply@uni-a.de
#SBATCH --time=1-0

# Request memory per CPU
#SBATCH --mem-per-cpu=1G
# Request 10 Tasks.
#SBATCH --ntasks=10
# Request 2 CPUs for per task.
#SBATCH --cpus-per-task=2
# Request GPU Ressources (model:number)
#SBATCH --gpus=a100:1
# Request to be an array Job
#SBATCH --array=1-40:10

# Set number of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Load application modules here if necessary

# Start MPS daemon
module load cuda-mps

# Load GNU parallel
module load parallel

# Start 10 Tasks using GNU Parallel, derive arguments from SLURM_ARRAY_TASK_ID
parallel -j${SLURM_NTASKS} \
	srun --ntasks=1 --exact --overlap python script.py {} '>' script-{}.out \
    	::: $(seq ${SLURM_ARRAY_TASK_ID} $((SLURM_ARRAY_TASK_ID+SLURM_ARRAY_TASK_STEP-1)))

# Stop MPS daemon
module unload cuda-mps

Bereichsverknüpfungen

Seitenhierarchie

Requesting a single GPU

Requesting multiple GPUs

Enforce resource bindung for better performance

CUDA Multi-Process-Server for maximum efficiency

Bereichsverknüpfungen

Seitenhierarchie

Submitting GPU Jobs

Requesting a single GPU

Requesting multiple GPUs

Enforce resource bindung for better performance

CUDA Multi-Process-Server for maximum efficiency