Requesting a single GPU
GPUs may be requested via Slurm via the epyc-gpu
partition and with additional parameters requesting GPU ressources.
- GPU ressources can only be requested in combination with CPU ressources.
- Most of the time not all parts of the application can be run on the GPU and significant parts have to be run on the CPU as well. CPU-usage is then usually OpenMP parallelized, but this is not strictly needed.
- Slurm allows up to X CPUs to be requested per GPU.
- There are 3 GPUs available per Node.
#!/usr/bin/env bash #SBATCH --job-name=test #SBATCH --partition=epyc-gpu #SBATCH --mail-type=END,INVALID_DEPEND #SBATCH --mail-user=noreply@uni-a.de #SBATCH --time=1-0 # Request memory per CPU #SBATCH --mem-per-cpu=1G # Request n CPUs for your task. #SBATCH --cpus-per-task=n # Request GPU Ressources (model:number) #SBATCH --gpus=a100:1 # Set number of OpenMP threads export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Load application modules here if necessary # No need to pass number of tasks to srun srun my_program
For applications requiring both intensive GPU as well as CPU usage, it is recommended not to request more than 32 CPUs/OpenMP Threads per Job. There are two reasons:
- The GPU nodes (partition
epyc-gpu
) have the same CPU configuration (2 sockets, 64 CPUs per socket) as the pure CPU nodes (partitionepyc
). Requesting more than 32 CPUs significantly increases the propability (especially when the nodes are already partially occupied) that not all of them can be placed on the same CPU socket. Inter-CPU (different socket) communication has an almost 3-fold latency compared to Intra-CPU (same socket) communication. For some applications using OpenMP threading we have observed a slowdown of about 2x when using 64 CPUs (distributed across two sockets) vs 32 CPUs (distributed on one socket). - Having 3 GPUs and 2 CPU sockets, one GPU will be unusable when for example two Jobs with 64 CPUs per GPU are requested.
Make sure that your program can acutally make use of a GPU. Otherwise GPUs stay idle and GPU ressources are blocked for others.
Requesting multiple GPUs
Multiple GPUs involving multiple nodes should not be requested at this time if excessive direct GPU-GPU communication is needed. The communication bandwidth is low and latency is high compared to same-node GPU-GPU communication.
If you require excessive direct GPU-GPU communication, the nodes of the epyc-gpu-sxm
partition are recommended, because they are connected to one another via NV-Link.
When running Multi-GPU Jobs, it is mandatory to run small benchmarks in order to validate scaling efficiency!
Usage of --gpus=n with n > 1
When using --gpus=2
it can happen that the two GPUs are on differen Nodes, even when --ntasks=1
. This cannot be handled by applications that do not support MPI (and even then, it would be a very odd setup).
Another example: running Jobs with --gpus=a100:5
with --ntasks=5
is not very explicit, it could be possible that Slurm might shedule the tasks and GPU not in a 1:1 relationship, for example 2+3 GPU on 3+2 CPU Tasks, when two nodes with 3 GPUs each are involved.
It is better to use one of the following options:
--ntasks=1
(implicit or explicit) and all your requested GPUs can fit on one Node, either- set
--nnodes=1
or - set
--gpus-per-task=n
(explicit--ntasks=1
needed as well)
- set
--ntasks=n
(n > 1
, MPI mandatory!)- set
--gpus-per-task=m
(our nodes support1 ≤ m ≤ 4
at the moment). Each group ofm
GPU(s) will be bound to their task. Use--gpu-bind=none
to lift the binding if needed. - set
--gpus-per-node=m
,--ntasks-per-node=k
and replace--ntasks=n
for--nnodes=n
for requestingn × m
× k
GPUs overn
nodes (the restrictionk × m ≤ 4
applies on partitionepyc-gpu-sxm
andk × m ≤ 3
applies onepyc-gpu
). Note that the tasks will "see" allk
GPUs on their respective Node.
- set
Enforce resource bindung for better performance
#SBATCH --gres-flags=enforce-binding
If a job specifies --gres-flags=enforce-binding, then only the identified cores can be allocated with each generic resource. This will tend to improve performance of jobs, but delay the allocation of resources to them. If specified and a job is not submitted with the --gres-flags=enforce-binding option the identified cores will be preferred for scheduled with each generic resource.
If --gres-flags=disable-binding is specified, then any core can be used with the resources, which also increases the speed of Slurm's scheduling algorithm but can degrade the application performance. The --gres-flags=disable-binding option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). If any core can be effectively used with the resources, then do not specify the cores option for improved speed in the Slurm scheduling logic. A restart of the slurmctld is needed for changes to the Cores option to take effect.
CUDA Multi-Process-Server for maximum efficiency
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically but not necessarily MPI jobs. MPS can increase job efficiency and throughput manyfold when the GPU compute capacity is not fully saturated by a single application process (see also when to use MPS).
Real example: Python script working with cuPy and small matrices (~100x100)
Tasks per GPU | without MPS | with MPS | ||
---|---|---|---|---|
Total Time per Job | Efficiency | Total Time per Job | Efficiency | |
1 | 13h09m | 100% (reference) | 13h09m | 100% (reference) |
4 | 1d2h20m | 200% | 13h55m | 378% |
5 | 1d10h38m | 192% | 13h58m | 471% |
10 | 2d23h05m | 185% | 21h24m | 615% |
A module cuda-mps
is provided which will start the MPS server when loading the module and will stop the MPS server when the module is unloaded. This also works with Multinode MPI Jobs. In this case one MPS daemon is started per node.
... # Start the MPS daemon module load cuda-mps # Do the work... # Stop the MPS daemon module unload cuda-mps
Stopping of the MPS daemon using module unload cuda-mps
in mandatory, otherwise the job will run indefinitly until a timeout happens.