6. Running Applications as Interactive Jobs on the LRZ AI Systems
Interactive jobs can be executed in an existing allocation of resources. Use the sinfo
command for an overview of available resources (partitions and current node states). Resources can be allocated with the salloc
command.
For example, if you want an allocation within the lrz-v100x2
partition, you would need to type the following command:
$ salloc -p lrz-v100x2 --gres=gpu:1
The --gres=gpu:1
argument above indicates that a single GPU is needed in that allocation. It is a required argument for all the partitions described in 1. General Description and Resources except the lrz-cpu
one. In order to make use of lrz-cpu
resources interactively, you must instead additionally provide the --qos=cpu
(or -q cpu
) option.
Interactive jobs are submitted to an existing allocation of resources using the srun
command. The following example executes the command bash
in the allocated node.
$ srun --pty bash
Additionally, the command can be executed within an Enroot container. The SLURM installation in the LRZ AI systems allows doing this via a plugin called pyxis (check https://github.com/NVIDIA/pyxis for documentation and extra options added for `srun`). The recommended way of doing this is to find a container image (from Docker Hub, NGC or locally stored) that provides all the required libraries. If this image does not exist, you can create your own image extending an existing one as indicated in our guide 9. Creating and Reusing a Custom Enroot Container Image. Once you have the the image location URI on a container repository or the path to a locally store image, you can provide it to srun
via the --container-image
argument. SLURM takes care of transparently creating the container out of that image and executing the command of your choice within it. The following is an example of how to run bash
, but this time within a container created out of an image that provides pytorch
and that comes from the nvcr.io
docker repository.
$ srun --pty --container-mounts=./data-test:/mnt/data-test \ --container-image='nvcr.io#nvidia/pytorch:22.12-py3' \ bash
The --container-mounts
option in the previous example indicates how to mount a folder from outside the container into the container. In the example, we are mounting a folder called data-test in the current directory to the folder /mnt/data-test within the container.
Additionally, there is the argument --container-name
that allows tagging and reusing a container during the same job allocation (i.e., in the scope of a single salloc
). The --container-name
option is not intended to take effect across job allocations (see here https://github.com/NVIDIA/pyxis/issues/30#issuecomment-717654607 for details.)