Job Processing with SLURM on SuperMUC-NG
see also:
General
The batch system on SuperMUC-NG is the open-source workload manager SLURM (Simple Linux Utility for Resource management). For details about the SLURM batch system, see Slurm Workload Manager.
Submit hosts are usually login nodes that permit to submit and manage batch jobs.
Intel processors on SuperMUC-NG support the hyperthreading mode which might increase the performance of your application. With hyperthreading, you have to increase the number of MPI tasks per node from 48 to 96 in your job script. Please be aware that with 96 MPI tasks per node each process gets only half of the memory by default. If you need more memory, you have to specify it in your job script and use the fat nodes (see example batch scripts).
List of relevant commands
Command | Purpose |
---|---|
sbatch | submit a job script |
scancel | delete or terminate a queued or running job |
squeue | print table of submitted jobs and their state. Note: non-privileged users can only see their own jobs. |
salloc | create an interactive SLURM shell |
srun | execute argument command on the resources assigned to a job. Note: must be executed inside an active job (script or interactive environment). mpiexec is an alternative and preferred on LRZ system |
sstat | Display various status information of a running job/step. |
sinfo | provide overview of cluster status |
scontrol | query and modify SLURM state |
sacct
is not available for users.
SLURM partitions (Queues) and their limits
- Batch queues are called partitions in SLURM.
- The allocation granularity is multiples of one node (only complete nodes are allocated and accounted for).
- Scheduling and priorization is based on a multifactor scheme including wait time, job size, partition, and required quality of service.
The following partitions are available. Check with sinfo
for more details:
partition | min-max nodes per job | max usable memory | cores per node | max run time (hours) | max running jobs per user | max submitted jobs per user (qos) | base job processing priority |
---|---|---|---|---|---|---|---|
test | 1-16 | 90 GB | 48 | 0.5 | 1 | 3 | (dedicated nodes) |
micro | 1-16 | 90 GB | 48 | 48 | 20 | 40 | low (*) |
general | 17-768 | 90 GB | 48 | 48 | 10 | 30 | medium |
large | 769-3168 (approx. half of system) | 90 GB | 48 | 24 | 2 | 5 | high |
fat | 1-128 | 740 GB | 48 | 48 | 2 | 10 |
(*) Remark: "micro" jobs are frequently executed by SLURM's backfilling algorithm, if some larger job from the "general" or "large" queue is terminating earlier than epected and leaving some unoccupied time slot in SLURM's scheduling matrix. For obvious reasons this time slot must be less than 48 hours long (less than 24 hours for a breaking "large" job). This implies, that it can be helpful to specify for "micro" jobs a maximum execution time well below the allowed maximum 48 (24) hours, because slots in SLURM's processor-time scheduling matrix being usable for backfilling usually need to have less than 48 hours duration. If a "micro" job is specifying a maximum execution time limit of the max. allowed 48 hours, than obviously such a job cannot qualify for its usage in the backfilling algorithm and consequently the queue waiting time will be longer (if not even maximized).
srun and mpiexec
With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited.
Note: mpiexec is the preferred and only supported way to start applications. srun might fail (particularly for hyperthreaded applications).
salloc / srun for interactive processing
- allocate nodes
- then execute one or more commands in this allocation
salloc is used the allocated nodes for interactive processing. The options for resource specification in salloc/srun/sbatch are the same.
Currently, at least --account, --time and --partition must be specified!
"srun" can be used instead of "mpiexec"; both commands execute on the nodes previously allocated by the salloc.
There is no advantage by using "salloc" over "sbatch --partition=test" in terms of wait time.
sbatch Command / #SBATCH option
Batch job options and resources can be given as command line flagsto sbatch (in which case they override script-provided values), or they can be embedded into a SLURM job script as a comment line of the form.
For a very simple job to test your setup see the following:
Batch Job Examples
General options applicable for all jobs#!/bin/bash # Job Name and Files (also --job-name) #SBATCH -J jobname #Output and error (also --output, --error): #SBATCH -o ./%x.%j.out #SBATCH -e ./%x.%j.err #Initial working directory (also --chdir): #SBATCH -D ./ #Notification and type #SBATCH --mail-type=END #SBATCH --mail-user=insert_your_email_here # Wall clock limit: #SBATCH --time=24:00:00 #SBATCH --no-requeue #Setup of execution environment #SBATCH --export=NONE #SBATCH --get-user-env #SBATCH --account=insert your_projectID_here #SBATCH --partition=insert test, micro, general, large or fat #constraints are optional #--constraint="scratch&work" <insert the specific options for resources and execution from below here> | Hints and Explanations: Replacement patterns in filenames: Notification types: requeue/no-requeue: environment: get-user-env account: Partition: Request a specific partition ("queue") for the resource allocation. constraint (optional): |
---|---|
Options for resources and execution (select and click to expand)
| Select the appropriate case on the left and merge it with the general options above.
nodes=<minnodes[-maxnodes]> ntasks: ntasks-per-node: ntasks-per-core: cpus-per-task: switches=<number>[@waittime hh:mm:ss] array: mpiexec: If SLURM can detect the number of tasks form its settings it is sufficient to use mpiexec without further parameters e.g., ear (energy aware runtime): Pinning: |
Submitting several jobs with dependencies
Defer the start of this job until the specified dependencies have been satisfied completed.< dependency_list> is of the form< type:job_id[:job_id][,type:job_id[:job_id]]>
- after:job_id[:jobid...] job can begin execution after the specified jobs have begun execution.
- afterany:job_id[:jobid...] job can begin execution after the specified jobs have terminated.
- afternotok:job_id[:jobid...] job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
- afterok:job_id[:jobid...]job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero
Chained jobs will be set on hold (Dependency). Please note that priorities start being set for a job in the chain only once the dependency has been released. Thus, there is no guarantee that the next job in the chain will start right after its predecessor in the chain finishes. The maximum length of the job chain is determined by max submitted jobs per user
for the queue (see table on top of the page). Chaining can be used to execute workflows with a minimum of supervision in a long lasting campaign.
Input Environment Variables
Upon startup, sbatch will read and handle the options set in the following environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. Some which may be used by you in $HOME/.profile:
Variable | Option |
---|---|
SBATCH_ACCOUNT | --account |
SBATCH_JOB_NAME | --jobid |
SBATCH_REQUEUE SBATCH_NOREQUEUE | --requeue --no-requeue |
Output Environment Variables
The Slurm controller will sets the variables in the environment of the batch script
Variable | Option |
---|---|
SLURM_JOB_ID SLURM_JOBID | Both variants return the SLURM JobID |
SLURM_JOB_ACCOUNT | Account name associated of the job allocation |
SLURM_JOB_NUM_NODES | Number of nodes. |
SLURM_JOB_NODELIST | To convert the Slurm compressed format into a full list: |
SLURM_NTASKS | Number of tasks. Example of usage: mpiexec -n $SLURM_TASKS |
SLURM_NTASKS | These variables are only set if the corresponding sbatch option was given. Example of usage: |
SLURM_JOB_CPUS_PER_NODE SLURM_TASKS_PER_NODE | Count of processors available to the Job: Returned value looks like "96(x128)". Number of tasks to be initiated on each node. Returned value looks like "8(x128)". |
SLURM_PROCID | The MPI rank (or relative process ID) of the current process. Can be used in wrapper scripts if [ $SLURM_PROCID ] |
File Patterns
sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter
Example: #SBATCH -o ./%x.%j.out
Pattern | Expansion |
---|---|
%j %J %a | jobid of the running job, |
%u | User name |
%x | Job Name |
%t | task identifier (rank) relative to current job. This will create a separate IO file per task. |
Useful commands
Show the estimated start time of a job: squeue --start [-u <userID>]
Guidelines for resource selection
Processing Mode
- Jobs that only use one or at most a few hardware cores perform serial processing and are not supported on SuperMUC-NG. Use the SuperMUC-Cloud for such purposes.
- Bunches of multiple independent tasks can be bundled into one job, using one or more nodes.
Run time limits
- Please note that all job classes impose a maximum run time limit. It can be adjusted downward for any individual job. Since the scheduler uses a backfill algorithm, the better you specify a realistic runtime limit, the better throughput of your job may be achieved.
Number of Islands/Switches
- This defines the maximum count of switches (=islands of SuperMUC-NG) desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired (lower) switch count or the time limit expires. It there is no switch count limit, there is no delay in starting the job. This trades off better performance vs. shorter wait time in the queue. Also use the minimum number of switches when you need good reproducablilty for profiling or benchmarking.
Energy aware runtime
- Switch the dynamic frequency adjustment off when you need good reproducablility for profiling.
Memory Requirements
- The total memory available in user space for the set of nodes requested by the job must not be exceeded.
- The memory used on each individual node must not be exceeded by all tasks run on that node.
- Applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. LRZ provides the "mixed" partition for using thin and fat nodes concurrently
Disk and I/O Requirements
- The disk and I/O requirements are not controlled by the batch scheduling system, but rely on parallel shared file systems, which provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.
- The appropriate usage of the parallel file systems is essential.
- Please consult File Systems of SuperMUC-NG for more detailed technical information.
Licences
- Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all.
- There is no connection from SuperMUC-NG to the outside. Check with LRZ if you are in need of such licenses.
- LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.
Conversion of scripts from LoadLeveler and other Workload Managers table
- see: List of the most common command, environment variables, and job specification options used by the major workload management systems
Resource usage of jobs
For currently running jobs, queries can be done via the sstat command, for example
sstat --fields=MaxRSS%30,MaxRSSnode%30 --jobs=123456
would supply the Maximum resident set size of all tasks in job with ID 123456, as well as the node on which this value was reached. Note that this will only work if the executable is appropriately executed under SLURM control i.e. via the mpiexec or srun commands.
For jobs that are already completed, you need to contact the servicedesk to obtain such information. We currently cannot expose the sacct interface to regular users.
Specific Topics (jobfarming, constraints)
SLURM Documentation
- SLURM Workload Manager at LRZ
- Command/option Summary (two pages)
- Documentation for SLURM at SchedMD
- The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sstat(1), sview(1)