Job farming with SLURM
Decision Matrix
Method Workload | Job Arrays | Multiple serial workers with srun multi-prog | Multiple serial Workers with mpiexec | Multiple (parallel) workers with srun | Multiple (parallel) workers with mpiexec (on one node) | pexec | redis/doRedis |
---|---|---|---|---|---|---|---|
Long running parallel jobs with large number of nodes for each worker | yes | no | no | no | no | no | no |
Parallel workers | yes | no | no | yes | yes | no | yes, with redisexec |
Serial workers | no | yes | yes | yes | yes | yes | yes (R or python) |
Number of workers = number of allocated cores | no | yes | yes | yes | yes | yes | yes |
Number of workers> number of allocated cores | no | no | no | yes | no | yes | yes |
Number of Workers >> number of allocated cores | no | no | no | no | no | yes | yes |
(very) unbalanced Workers | yes | no | no | yes | no | yes | yes |
Task identifier
Depending on the method the following environment variables can be used to distinguish between tasks
- SLURM_ARRAY_TASK_ID: for Array jobs
- SLURM_PROCID: The MPI rank (or relative process ID) of the current process (with srun)
SLURM_LOCALID: Node local task ID for the process within a job (with srun)
- SLURM_STEPID: The step ID of the current job (with srun)
- PMI_RANK: The MPI rank (or relative process ID) of the current process with Intel MPI (with mpiexec)
- SUBJOB within pexec
Methods
Job Arrays
Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with many tasks can be submitted in milliseconds. All jobs must have the same initial options (e.g. size, time limit, etc.). Job arrays will have additional environment variable set.
- SLURM_ARRAY_JOB_ID will be set to the first job ID of the array.
- SLURM_ARRAY_TASK_ID will be set to the job array index value.
- SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job
#SBATCH --nodes=100 #SBATCH --ntasks=4800 #SBATCH --array=1-10 #same program but different input data mpiexec -n $SLURM_NTASKS ./myprog <in.$SLURM_ARRAY_TASK_ID
Combining Job Arrays with other methods below is possible (e.g. with "Mutliple parallel workers with srun", or "srun --multi-prog")
Multiple serial Workers with srun --multi-prog
Run a job with different programs and different arguments for each task. In this case, the executable program specified is actually a configuration file specifying the executable and arguments for each task. The number work tasks is limited by the number of SLURM tasks.
- Task rank: One or more task ranks to use this configuration. Multiple values may be comma separated. Ranges may be indicated with two numbers separated with a '-' with the smaller number first (e.g. "0-4" and not "4-0"). To indicate all tasks not otherwise specified, specify a rank of '*' as the last line of the file.
- Executable: The name of the program to execute
- Arguments: The expression "%t" will be replaced with the task's number (SLURM_TASKID). The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4"). Single quotes may be used to avoid having the enclosed values interpreted. The expression "%t" will be replaced with the task's number. The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4", SLURM_LOCALID).
cat example.conf ################################################################### 4-6 hostname 1,7 echo task:%t 2-3 echo offset:%o 0 bash myscript <Input.%t srun -n 8 --multi-prog example.conf
Multiple serial Workers with mpiexec
a) only a few commands
mpiexec -n 1 ./first-script : -n 2 ./second-script : -n 3 ./third-script
b) many commands using PMI_RANK
cat workers #!/bin/bash if [ $PMI_RANK -lt 10 ] then ./firstten <input.$PMI_RANK >output.$PMI_RANK else ./manyothers <input.$PMI_RANK >output.$PMI_RANK fi mpiexec workers
Multiple parallel workers with srun
srun can be used as a resource manager, which also works with OpenMP threading. The following script runs multiple job steps in parallel within an allocated set of nodes. Currently, we recommend issuing a small sleep between the submission of the tasks. The sum of the nodes involved in the job steps should not be larger than the number of allocated nodes of the job. Here we provide an example for a script where 128 work units have to be performed and up to 10 workers are running in parallel, each submitting one sub-job of 2 nodes per unit at the time. Therefore we need to allocate 20 nodes for the whole job. In this example, each subjob uses 8 tasks-per-node
(thus 16 tasks per worker in total) and 6 cpus-per-task
for OpenMP threading. For that we export, as usual, the value of OMP_NUM_THREADS
and invoke srun
with the -c $SLURM_CPUS_PER_TASK
option. Remove both (or set cpus-per-task
to 1) and adjust ntasks-per-node
for running without OpenMP.
... #SBATCH --nodes=20 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=6 ... export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ... # Fill just the following two MY_NODES_PER_UNIT=2 MY_WORK_UNITS=128 # Just algebra, these need no changes MY_WORKERS=$(( $SLURM_JOB_NUM_NODES / $MY_NODES_PER_UNIT )) MY_TASKS_PER_WORKER=$(( $SLURM_NTASKS_PER_NODE * $MY_NODES_PER_UNIT )) for (( I=1; I<=$MY_WORK_UNITS; I++ )) ; do # Count the work units that started while true ; do # Scan for a free worker SUBJOBS=`jobs -r | wc -l` # detect how many subjobs are already running if [ $SUBJOBS -lt $MY_WORKERS ] ; then # submit only if at least one worker is free sleep 4 # wait before any submission # wrapper could also be an MPI program,`-c $SLURM_CPUS_PER_TASK` is only for OpenMP srun -N $MY_NODES_PER_UNIT -n $MY_TASKS_PER_WORKER -c $SLURM_CPUS_PER_TASK -J subjob.$I ./wrapper $I >OUT.$I & break # So "I" will get +1 fi done done wait # for the last pool of work units exit
Here the important points to note are that we background each srun with the "&", and then tell the shell for wait for all child processes to finish before exiting.
If the sub-jobs are well balanced you can, of course, do the following:
srun -N 2 -n 16 -c 6 -J subjob.1 ./wrapper 1 >OUT.1 & srun -N 2 -n 16 -c 6 -J subjob.2 ./wrapper 2 >OUT.2 & srun -N 2 -n 16 -c 6 -J subjob.3 ./wrapper 3 >OUT.3 & ... srun -N 2 -n 16 -c 6 -J subjob.10 ./wrapper 10 >OUT.10 & wait # now do the next batch of jobs srun -N 2 -n 16 -c 6 -J subjob.11 ./wrapper 11 >OUT.11 & ...
Multiple parallel workers with mpiexec (within a node)
It is not possible to run more than one MPI program concurrently using the normal startup. However, within a node you can start an arbitrary number of mpiexec using the communication within the shared memory. Typically you have to specify the processor list for pinning to avoid overlap of the particular programs (However, you can do this if you need).
#First create a hostfile with the entry localhost for I in `seq 0 95` do echo localhost done >localhost #prepare the MPI environment unset I_MPI_HYDRA_BOOTSTRAP unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS export I_MPI_FABRICS=shm export I_MPI_HYDRA_HOST_FILE=localhost #now you can start multiple mpiexecs with the node #load a test program module load lrztools I_MPI_PIN_PROCESSOR_LIST=0-9 mpiexec -n 10 placementtest-mpi.intel >OUT.1 & I_MPI_PIN_PROCESSOR_LIST=10-19 mpiexec -n 10 placementtest-mpi.intel >OUT.2 & I_MPI_PIN_PROCESSOR_LIST=20-29 mpiexec -n 10 placementtest-mpi.intel >OUT.3 & I_MPI_PIN_PROCESSOR_LIST=30-39 mpiexec -n 10 placementtest-mpi.intel >OUT.4 & I_MPI_PIN_PROCESSOR_LIST=40-47 mpiexec -n 8 placementtest-mpi.intel >OUT.5 & wait
Here the important points to note are that we background each mpiexec with the "&", and then tell the shell for wait for all child processes to finish before exiting.
If the tasks are not run in the background then they will run one after the other and if the memory is not divided then the first srun will take the entire allocation thus preventing the others from starting which also causes the sequential execution of the calls to mpiexec.
if you want several of these mpiexecs, you can pack the second part of the script above into a shell script, make it executable and execute it on each of your allocated nodes with srun (as described in the previous section).
serial commands with pexec from lrztools
pexec takes a configuration file with serial commands. The number of worker tasks may be much larger than the number of allocated cores. Within the script or wrapper the environment variable $SUBJOB may by used to distinguish between tasks. The next free core will take the next task in the list.
module load lrztools cat wrapper #!/bin/bash ./serial <input $SUBJOB >output.$SUBJOB cat tasklist ./serial-script <input.1 >output.1 ./serial-exe <input.2 >output.2 ... ./serial-command <input.1000 >output.1000 ./wrapper ./wrapper mpiexec -n 150 pexec tasklist
Using R and redis
A simple worker queue for R functions using redis as database
Redisexec can be used in single node mode (default) or in MPI mode (used if 'nodespertask'>1 or 'forcempi'=1). In MPI-mode, redisexec will automatically split up the SLURM host file to create MPI groups of size 'nodespertask' (one of redisexec's arguments). Currently, only Intel-MPI is supported.
For more documentation please see the github page.