7. Running Applications as Batch Jobs on the LRZ AI Systems

Batch jobs are the preferred way of using the LRZ AI Systems. In a batch job, the allocation of resources and job submission are done in a single step. If no resources are available, the job queues until the requested allocation is possible. Batch jobs are then non-interactive jobs.

Laying some foundation: parallel and non-parallel batch jobs with SLURM

The sbatch command submits jobs by describing them in a file with a special format. This file is usually referred to as "sbatch script", or just "batch script". Once the script is created, it is submitted as:

$ sbatch enroot_test.sbatch

An example of a batch script is depicted next.

#!/bin/bash
#SBATCH -p lrz-v100x2 
#SBATCH --gres=gpu:1
#SBATCH -o file_to_redirect_the_std_output.out
#SBATCH -e file_to_redirect_the_std_err.err

command1
srun command2

The first part of the example batch script file is the preamble (indicated by the lines starting with #! and #SBATCH). There, the resources that are needed for executing the job (the allocation) are described. Slurm options are used that are comparable to the ones for interactive jobs. Two additional arguments required in sbatch scripts are: where we want the output and error messages of the job to be redirected to. As the job is not interactive, there will be no terminal/shell where it will write to, so we indicate here the files where we would like to have the output and error saved when the job executes (file_to_redirect_the_std_output.out and file_to_redirect_the_std_err.err respectively in this example.)

After the preamble, the job to be executed is described. Our example batch script contains two commands. The fist one is not preceded by srun (or mpiexe, or mpirun). This command will create a job that runs in the first node of our allocation. The second command is preceded by srun (or mpirun, or mpiexec). In this case, the command will launch a parallel job, running (per default) in parallel on all nodes of the allocation. If the allocation contains only a single node, contrary to the case of command1, a parallel job that runs on. a single node will be created. The latter will initialise some MPI related environment variables, which will influence the computation. For example, every process created by the job in that node will have assigned a local_rank and a global_rank.

Batch jobs and Enroot containers

Non-parallel jobs

The way to run containerised non-parallel jobs with SLURM is by starting the job within an already existing container. This implies two perform two separate steps within the batch script:

creating the container out the container image, and
starting the command within the created container

The following script shows an example.

#!/bin/bash
#SBATCH -p lrz-v100x2 
#SBATCH --gres=gpu:1
#SBATCH -o file_to_redirect_the_std_output.out
#SBATCH -e file_to_redirect_the_std_err.err
enroot create --name job-cont
enroot start job-cont command1

The line enroot create --name job-cont creates a container named job-cont in the first node of the allocation (in this case also the only one) out of the image 'nvcr.io#nvidia/pytorch:22.12-py3'.

The line enroot start job-container command1 executes command1 in the first node of the allocation (in this case also the only one) within the job-cont container.

As of Ubuntu 22.04, using the Enroot command line interface for starting the job without previously creating the container is not possible.

Parallel jobs with different containers per job step

For containerised parallel jobs, even when allocating only a node, we recommend to rely on the Pyxis plugin (https://github.com/NVIDIA/pyxis) capabilities. The following example executes command1 within a container created out of the image 'nvcr.io#nvidia/pytorch:22.12-py3' in each of the allocated node. After completion of command1, the script executes command2 within a container created out of the image `nvcr.io#nvidia/tensorflow:22.12-py3' in each of the allocated node. After completion of command2, the script will execute command3 followed by command4 in the same container created out of the image `nvcr.io#nvidia/tensorflow:22.12-py3' in each of the allocated node. Notice that between the different calls to srun, the containers are different (even when created from the same image.)

#!/bin/bash
#SBATCH -p lrz-v100x2 
#SBATCH --gres=gpu:1
#SBATCH -o file_to_redirect_the_std_output.out
#SBATCH -e file_to_redirect_the_std_err.err

srun --container-image=nvcr.io#nvidia/pytorch:22.12-py3 command1 
srun --container-image=nvcr.io#nvidia/tensorflow:22.12-py3 command2 
srun --container-image=nvcr.io#nvidia/tensorflow:22.12-py3 bash -c "command3 ; command4"

Parallel jobs with the same container for all job steps

For reusing a single container for all your commands, you can benefit from using --container-image in the batch script preamble. Despite srun does not explicitly precede the commands in our script, all of them are executed as parallel job; notice that using srun explicitly within this batch script will fail (as srun is not available within the scope of an already parallel job.)

#!/bin/bash
#SBATCH -p lrz-v100x2 
#SBATCH --gres=gpu:1
#SBATCH -o file_to_redirect_the_std_output.out
#SBATCH -e file_to_redirect_the_std_err.err
#SBATCH --container-image="docker://nvcr.io#nvidia/tensorflow:23.12-tf2-py3"

command1
command2