Job Farming with GNU parallel

What is GNU parallel

"GNU parallel is a shell tool for executing jobs in parallel using one or more computers." (documentation)

General Considerations

As with all job farming concepts, also here some statistics about the tasks' requirement (runtime, # cpus, memory) are required in order to be able to assess the total farming job's requirements. And it is the determined goal of jobfarming to obtain a throughput of a high rate. But 100% is most probably not possible. These remaining cases, that failed (for one or the other reason) must be handled then separately.

GNU parallel allows for a lot a input options. Please consult the documentation, specifically the tutorial.

Currently, GNU plot does not allow directly for task dependencies. For those, other tools or some hand-made solution must be found.

Slurm and GNU parallel Example

There are many examples given on the documentation page of GNU parallel. We just illustrate some very common case.

A very simple mode of operation could be as follows.

jobfarm.slurm
#!/bin/bash
#SBATCH -J jobfarm_test
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D ./
#SBATCH --mail-type=NONE
#SBATCH --time=T.B.A.
#SBATCH --export=NONE
#SBATCH --clusters=...
#SBATCH --partition=...
#SBATCH --nodes=2
#SBATCH --ntasks=4               # needed for SLURM_NTASKS; see -P option parallel below

module load slurm_setup
module load parallel

export OMP_NUM_THREADS=14
export MY_SLURM_PARAMS="-N 1 -n 1 -c 28 --mem=27G --exact --export=ALL --cpu_bind=verbose,cores --mpi=none"

export MYEXEC=/lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only
export PARAMS=("-d 20" "-d 10" "-d 20" "-d 10" "-d 20" "-d 10" "-d 20" "-d 10")

task() {
   echo "srun $MY_SLURM_PARAMS $MYEXEC $2 &> log2.$1"
   srun $MY_SLURM_PARAMS $MYEXEC $2 &> log2.$1
}
export -f task

parallel -P $SLURM_NTASKS task {#} {} ::: "${PARAMS[@]}"

The Slurm SBATCH header is kept quite general ... number of nodes, and number of srun tasks that can run in parallel. The actual resource requirement per task is specified as srun parameter list.

The bash function task performs the task's actual workflow. A task is defined here as execution of placement-test (just a dummy and test program; replace it by your program) with different input parameters – here given by the bash array PARAMS. Instead of task, also a bash script or any other program is conceivable. The first parameter it here expects is the task sequence number ({#}). We use this to index the tasks with an ID, and parameterize the log files. But other uses are conceivable. PARAMS can contain e.g. also working directories, that can be given to srun (via -D option). Maybe it is more reasonable to use readarray to read in PARAMS from a file. But GNU parallel also allows for direct file input.

This all allows for a rather flexible setup of general job or task farming concepts. However, in order this to work, the correct resource requirements for each task must be specified. We cannot give here all possible variants as an example, because also MPI programs, or hybrid MPI-OpenMP programs can be run as tasks. To obtain the correct placement of the ranks and threads to the CPUs, we recommend to start with dummies (like placement-test), or use the --verbose option of srun, in order to see the CPU masks explicitely. Using sacct, the job steps created by srun can be scrutinized for where and when they ran, and for how long, and how much memory was used (peak). To get some confidence on how this works, we recommend to play with some dummy tasks on login nodes (then without srun), or the interactive queue. GNU parallel also has a --dryrun option, for checking that the workflow is at least in principle correct. Still, expect that other things can go wrong!

Although GNU parallel can in principle spawn tasks to nodes also directly via SSH, we discourage this use. Slurm is the default resource manager and scheduler on our systems. Using srun as illustrated is therefore strongly recommended.

Nice features of GNU parallel

GNU parallel has some options like --eta, for getting some estimated time to completion (very rough), and --joblog, --resume, --resume-failed, and --retry-failed. The latter allow for a simple but effective way to do bookkeeping of the tasks. In order that this works, task must return some success (0) or failure (<>0) code. Possible using set -e inside task could be a good idea. The resume flags are great for cases that a farming job failed (e.g. node failed) or simply timed out (Slurm job wall clock limits are given on out queues). One can simply restart this job by just adding one of these flags.

GNU plot also know the option --results, which can be given a name which is a directory (the tasks then must be unique), or a file e.g. when ending with .csv or .json. These then contain the stdout, stderr and task sequence number.

Task Dependency Graph

The semantics here is that some tasks may depend on the successfull execution of one or more former tasks. Such task dependencies can be illustrated in a task dependency graph (or, in most cases it is just a tree). For complicated task dependencies, there are workflow tools available (each with its weaknesses and strengths - please also consider radical cybertools).

However for simple task dependency chains, GNU parallel can be used as a simple task workflow tool. The rough idea is that tasks are placed into a file. Some of these tasks write on successful completion subsequent task(s) into the same file. Using tail -f (essentially), these updates of the file are fed to GNU parallel.

Another option is a combination of writing to the task file (appending tasks) and restart parallel with --joblog and --resume. For instance.

jobfarm.slurm
#!/bin/bash
#SBATCH -J jobfarm_test
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D ./
#SBATCH --mail-type=NONE
#SBATCH --time=00:05:00
#SBATCH --export=NONE
#SBATCH --clusters=...
#SBATCH --partition=...
#SBATCH --nodes=2
#SBATCH --ntasks=4               # needed for SLURM_NTASKS on parallel

module load slurm_setup
module load parallel

export OMP_NUM_THREADS=14
export MY_SLURM_PARAMS="-N 1 -n 1 -c 28 --mem=27G --exact --export=ALL --cpu_bind=verbose,cores --mpi=none"
export MYEXEC=/lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only
export CMD="srun $MY_SLURM_PARAMS $MYEXEC"

# create a task file
task_db=task.db
if [ ! -f $task_db ]; then
   cat > $task_db << EOF
$CMD -d 20
$CMD -d 10
$CMD -d 15
$CMD -d 30 ; echo $CMD -d 10 >> $task_db
$CMD -d 20 ; echo $CMD -d 15 >> $task_db
$CMD -d 10 ; echo $CMD -d 20 >> $task_db
$CMD -d 10 ; echo $CMD -d 10 >> $task_db
EOF
fi

task() {
   set -e
   echo "$2 &> log.$1"
   eval $2 &> log.$1
}
export -f task

num_jobs=""
while [ "$num_jobs" != "$(wc -l < $task_db)" ]; do
  num_jobs=$(wc -l < $$task_db)
  parallel --joblog my_joblog --resume --verbose -P $SLURM_NTASKS task {#} {} :::: $task_db
done

Important to know is that for CoolMUC-2, 28 CPUs per node are available (without hyperthreading). On other nodes, the resources are different, and the placement of tasks to these resources must be adapted.


Another possibility could look as follows.

jobfarm.slurm
#!/bin/bash
#SBATCH -J jobfarm_test
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D ./
#SBATCH --mail-type=NONE
#SBATCH --time=00:05:00
#SBATCH --export=NONE
#SBATCH --clusters=...
#SBATCH --partition=...
#SBATCH --nodes=2
#SBATCH --ntasks=4               # needed for SLURM_NTASKS on parallel

module load slurm_setup
module load parallel

# create a monitor
stop_monitor() {
  # wait until jobs started
  while [ "$(ps aux | grep tail | grep -v grep)" == "" ]; do
    sleep 2;
  done
  PID=$(ps aux | grep tail | grep -v grep | awk '{print $2;}')
  echo ">>> tail PID >>> $PID"
  sleep 20
  # wait until all srun processes stopped
  while [ "$(ps aux | grep srun | grep -v grep | wc -l)" != "0" ]; do
    sleep 10;
    echo ">>> active srun processes >>> $(ps aux | grep srun | grep -v grep | wc -l)"
  done
  kill -9 $PID
}

export OMP_NUM_THREADS=14
export MY_SLURM_PARAMS="-N 1 -n 1 -c 28 --mem=27G --exact --export=ALL --cpu_bind=verbose,cores --mpi=none"
export MYEXEC=/lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only
export CMD="srun $MY_SLURM_PARAMS $MYEXEC"

# create a task file
task_db=task.db
if [ ! -f $task_db ]; then
   cat > $task_db << EOF
$CMD -d 20
$CMD -d 10
$CMD -d 15
$CMD -d 30 ; echo $CMD -d 10 >> $task_db
$CMD -d 20 ; echo $CMD -d 15 >> $task_db
$CMD -d 10 ; echo $CMD -d 20 >> $task_db
$CMD -d 10 ; echo $CMD -d 10 >> $task_db
EOF
fi

task() {
   set -e
   echo "*** task *** $2"
   eval $2 #&> log.$1
}
export -f task

stop_monitor &
tail -n+0 -f $task_db | parallel --joblog my_joblog --results output.csv --verbose -P $SLURM_NTASKS task {#} {}

The output of each task is written to the output.csv file. A restart via --resume will overwrite this file. So, caution!

The stop_monitor just surveils the yet running srun processes. Without this, tail would stay running (hanging), even if there is no entry in the task database anymore left.