5. Slurm

Overview

Slurm is the open-source resource manager and job scheduler used on the AI Systems. It allocates computing resources, runs, monitors and manages job queues, with optional plugins for advanced scheduling and accounting.

The following pages expand on the two types of jobs you can submit to Slurm.

Interactive jobs are best suited for developing, testing and setting up a suitable container environment for your workloads. To avoid extensive waiting in queue, it is recommended to allocate only a minimal number of GPUs for a "short" time.

Sbatch jobs are best suited for potentially long running and heavy workloads. It's recommended to sufficiently test your work with an interactive job before submitting an sbatch job.

S-Commands

The following are the most frequently used Slurm commands that users rely on for day-to-day work. They cover the essential steps of checking resources, submitting jobs, monitoring progress, and managing running or completed jobs.

sinfo - Show available partitions, nodes, and their status.
squeue - Display currently queued and running jobs.
srun - Submit or launch a job (interactive or batch).
sbatch - Submit a batch job script to the scheduler.
scancel - Cancel a running or pending job.
salloc - Allocate resources for an interactive job session.
sacct - View job accounting and usage information.

For detailed information, please have a look at the slurm cheat. It lists all common commands and their arguments and environment variables. It can be found here: Slurm Cheatsheet

Key Points

The following key points describe important details specific to the Slurm setup on the AI Systems.
Make sure to follow these conventions when submitting or running jobs to ensure your workloads start and run correctly.

It's necessary to set gres when requesting resources otherwise jobs won't start. This argument determines the number of GPUs requested per node. For example, one can request a single GPU per node by setting: --gres=gpu:1 (more general --gres=gpu:<number_of_GPUs>) and at most 8 GPUs can be requested per job. HGX nodes have 4, DGX nodes have 8 GPUs available.
For individual jobs and allocations, the default time limit is 1 hour, and the maximum time limit is 2 days.