8. Multi-GPU Jobs on the LRZ AI Systems
Some jobs require more than a single GPU for performing their computations. A typical example is parallel machine learning training. The --gres
argument can be used accordingly for requesting several GPUs within an allocation.
Interactive Jobs
For example, if you need all the 8 GPUs on the lrz-dgx-1-v100x8
partition, you would need to type the following command:
$ salloc -p lrz-dgx-1-v100x8 --gres=gpu:8
Additionally, some frameworks, especially those relying on MPI for parallelisation, require to start as many processes per node as GPUs are meant to be used. An example of such a framework is Horovod. To do this, in addition to indicating the number of GPUs via the --gres
argument, the --ntasks-per-node
argument must be used to specify the number of processes that are to be started per node in the allocation:
$ salloc -p lrz-dgx-1-v100x8 --ntasks-per-node=8 --gres=gpu:8
Batch Jobs
The situation is similar for batch jobs. The --gres
argument needs to be added to the script preamble preceded by the #SBATCH
label. Afterwards, you can use the --ntasks-per-node
argument within the srun
command as indicated above. An example is as follows:
#!/bin/bash #SBATCH -p lrz-dgx-1-v100x8 #SBATCH --gres=gpu:8 #SBATCH -o enroot_test.out #SBATCH -e enroot_test.err srun --mpi=pmi2 --ntasks-per-node=8 --container-mounts=./data-test:/mnt/data-test \ --container-image='horovod/horovod+0.16.4-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5' \ python script.py --epochs 55 --batch-size 512
Notice than in this case, to enable the use of MPI the --mpi=pmi2
is added to the srun
command to enable the use of MPI by that job. If your job does not use MPI, this argument is not needed.