5.3 Slurm Batch Jobs - Multi GPU

Some jobs, such as parallel machine learning training, require the use of multiple GPUs. To request several GPUs within an allocation, use the --gres argument (e.g., --gres=gpu:4).

Many frameworks, especially those relying on MPI for parallelization (such as Horovod), normally require one process per GPU on each node. To meet this requirement, it is essential to specify both the number of GPUs with --gres and the number of processes per node using --ntasks-per-node, ensuring a one-to-one mapping between processes and GPUs. Some modules, such as Torchrun require only a single process per node for any number of GPUs. An example of is provided below.

This becomes particularly important when the job spans multiple nodes. In such cases, --ntasks-per-node ensures that each node runs the correct number of processes. Without this setting, the process-to-GPU mapping may be inconsistent across nodes, leading to resource underutilization or misconfiguration.

Examples

Interactive Jobs

For example, if you need all the 4 GPUs on the lrz-dgx-1-v100x8 partition, use the following command:

salloc -p lrz-dgx-1-v100x8 --ntasks-per-node=4 --gres=gpu:4

Using Batch Jobs and Torchrun

The --gres argument should be included in the script preamble using the #SBATCH directive to request the desired number of GPUs.

The srun command is used to launch the container inside which the torchrun is executed. The --ntasks-per-node=1 is necessary in order to launch only one container and one torchrun process per node. The torchrun command will then handle the parallelization of the PyTorch script.

Here’s an example SLURM script to launch a distributed job across 2 GPUs on a single node of the lrz-dgx-1-v100x8 partition.

#!/bin/bash
#SBATCH --job-name=multi-gpu-single-node
#SBATCH --output=log-%j.out
#SBATCH -error=log-%j.err
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --partition=lrz-dgx-1-v100x8
#SBATCH --time=00:15:00

# One container per node; torchrun launches one worker per GPU on the node
# In the following, one should replace these placeholders with sensible parameters: your_container_image.sqsh, your_project_folder, your_code.py
srun --ntasks=1 \
    --container-image="$HOME/your_container_image.sqsh" \ 
    --container-mounts="$HOME/your_project_folder:/workspace" \
    torchrun --standalone --nproc_per_node="${SLURM_GPUS_ON_NODE}" \
        /workspace/your_code.py