7. Running Applications as Batch Jobs on the LRZ AI Systems
Batch jobs are the preferred way of using the LRZ AI Systems. In a batch job, the allocation of resources and job submission are done in a single step. If no resources are available, the job queues until the requested allocation is possible. Batch jobs are then non-interactive jobs.
Laying some foundation: parallel and non-parallel batch jobs with SLURM
The sbatch
command submits jobs by describing them in a file with a special format. This file is usually referred to as "sbatch script", or just "batch script". Once the script is created, it is submitted as:
$ sbatch enroot_test.sbatch
An example of a batch script is depicted next.
#!/bin/bash #SBATCH -p lrz-v100x2 #SBATCH --gres=gpu:1 #SBATCH -o file_to_redirect_the_std_output.out #SBATCH -e file_to_redirect_the_std_err.err command1 srun command2
The first part of the example batch script file is the preamble (indicated by the lines starting with #! and #SBATCH). There, the resources that are needed for executing the job (the allocation) are described. Slurm options are used that are comparable to the ones for interactive jobs. Two additional arguments required in sbatch scripts are: where we want the output and error messages of the job to be redirected to. As the job is not interactive, there will be no terminal/shell where it will write to, so we indicate here the files where we would like to have the output and error saved when the job executes (file_to_redirect_the_std_output.out
and file_to_redirect_the_std_err.err
respectively in this example.)
After the preamble, the job to be executed is described. Our example batch script contains two commands. The fist one is not preceded by srun
(or mpiexe
, or mpirun
). This command will create a job that runs in the first node of our allocation. The second command is preceded by srun (or mpirun, or mpiexec).
In this case, the command will launch a parallel job, running (per default) in parallel on all nodes of the allocation. If the allocation contains only a single node, contrary to the case of command1,
a parallel job that runs on. a single node will be created. The latter will initialise some MPI related environment variables, which will influence the computation. For example, every process created by the job in that node will have assigned a local_rank
and a global_rank
.
Batch jobs and Enroot containers
Non-parallel jobs
The way to run containerised non-parallel jobs with SLURM is by starting the job within an already existing container. This implies two perform two separate steps within the batch script:
- creating the container out the container image, and
- starting the command within the created container
The following script shows an example.
#!/bin/bash #SBATCH -p lrz-v100x2 #SBATCH --gres=gpu:1 #SBATCH -o file_to_redirect_the_std_output.out #SBATCH -e file_to_redirect_the_std_err.err enroot create --name job-cont enroot start job-cont command1
The line enroot create --name job-cont
creates a container named job-cont
in the first node of the allocation (in this case also the only one) out of the image 'nvcr.io#nvidia/pytorch:22.12-py3'.
The line enroot start job-container command1
executes command1
in the first node of the allocation (in this case also the only one) within the job-cont
container.
Parallel jobs with different containers per job step
For containerised parallel jobs, even when allocating only a node, we recommend to rely on the Pyxis plugin (https://github.com/NVIDIA/pyxis) capabilities. The following example executes command1
within a container created out of the image 'nvcr.io#nvidia/pytorch:22.12-py3' in each of the allocated node. After completion of command1,
the script executes command2
within a container created out of the image `nvcr.io#nvidia/tensorflow:22.12-py3' in each of the allocated node. After completion of command2,
the script will execute command3
followed by command4
in the same container created out of the image `nvcr.io#nvidia/tensorflow:22.12-py3' in each of the allocated node. Notice that between the different calls to srun,
the containers are different (even when created from the same image.)
#!/bin/bash #SBATCH -p lrz-v100x2 #SBATCH --gres=gpu:1 #SBATCH -o file_to_redirect_the_std_output.out #SBATCH -e file_to_redirect_the_std_err.err srun --container-image=nvcr.io#nvidia/pytorch:22.12-py3 command1 srun --container-image=nvcr.io#nvidia/tensorflow:22.12-py3 command2 srun --container-image=nvcr.io#nvidia/tensorflow:22.12-py3 bash -c "command3 ; command4"
Parallel jobs with the same container for all job steps
For reusing a single container for all your commands, you can benefit from using --container-image
in the batch script preamble. Despite srun
does not explicitly precede the commands in our script, all of them are executed as parallel job; notice that using srun
explicitly within this batch script will fail (as srun
is not available within the scope of an already parallel job.)
#!/bin/bash #SBATCH -p lrz-v100x2 #SBATCH --gres=gpu:1 #SBATCH -o file_to_redirect_the_std_output.out #SBATCH -e file_to_redirect_the_std_err.err #SBATCH --container-image="docker://nvcr.io#nvidia/tensorflow:23.12-tf2-py3" command1 command2