Flux Framework - Flux in Slurm

Getting started ...

What is it?

Flux Framework is a task scheduling and resource management framework - much like Slurm. However, it can be run completely in user space. And we describe it here as an alternative to Slurm's srun task farming capabilities.

Flux is rather versatile, but also quite complex - and still under very active development. We must therefore refer to the flux documentation for all the details left out here.

Using LRZ Module

> module av flux-core
------------------ /lrz/sys/share/modules/files_sles15/tools -------------------------
flux-core/0.63.0    flux-core/0.64.0
> module load flux-core

Own Installation

The simplest installation is probably via conda.

> conda create -n my_flux -c conda-forge flux-core flux-sched
> conda activate my_flux
(my_flux) > flux version
commands:    		0.64.0
libflux-core:		0.64.0
build-options:		+hwloc==2.8.0+zmq==4.3.5

If you need a more up-to-date version of flux, you probably can't get around to build it from source (https://github.com/flux-framework/). But spack may help you to simplify that process.

Another option to install flux-core is Spack (user_spack). However, in order to get the latest version, manual manipulation of the package will be necessary.

flux-sched is not necessary. A simple scheduler is always available. But fluxion scheduler is supposed to be better than that. We could not find applications schenarios that convinced us.

Interactive Workflows

Real interactive work with Flux is probably not so reasonable. But for testing purposes, and as sort of a starting point, let's have a short look at it. We start from a login node.

login > module load flux-core                                                          # or, activate activate flux environment 
(my_flux) login > srun -N 2 -M inter -p cm2_inter --pty flux start                     # allocate resources (on cluster/partition you want)
i22r07c05s05 > flux uptime                                                             # basic info about the running flux instance
 14:11:57 run 7.9s,  owner ⼌⼌⼌⼌⼌⼌⼌,  depth 0,  size 2
i22r07c05s05 > flux resource info                                                      # basic info about the resources managed by the flux instance
2 Nodes, 56 Cores, 0 GPUs
i22r07c05s05 > flux run --label-io -N2 hostname                                        # run a task (here, on each node one)
0: i22r07c05s05
1: i22r07c05s08
i22r07c05s05 > flux bulksubmit --output=log.{{id}} -n 1 -c 7 /lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only -t 7 -d 20 ::: $(seq 0 100)
ƒCF6D7Bu                                                                               # flux job IDs
[...]
i22r07c05s05 > flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
[...]
    ƒCL2LiaU ⼌⼌⼌⼌⼌⼌⼌  placement+  S      1      -        - 
    ƒCGVkRgt ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   8.580s i22r07c05s05
    ƒCGVkRgs ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   10.15s i22r07c05s11
    ƒCGUGSQa ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   12.45s i22r07c05s11
    ƒCGUGSQZ ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   12.45s i22r07c05s11
    ƒCGUGSQY ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   12.79s i22r07c05s05
    ƒCGUGSQX ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   13.35s i22r07c05s11
    ƒCGSnT8C ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   14.15s i22r07c05s05
    ƒCGSnT8B ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   17.15s i22r07c05s05
    ƒCG62dBP ⼌⼌⼌⼌⼌⼌⼌  placement+ CD      1      1   23.41s i22r07c05s05
    ƒCG62dBQ ⼌⼌⼌⼌⼌⼌⼌  placement+ CD      1      1   19.54s i22r07c05s11
    ƒCG62dBM ⼌⼌⼌⼌⼌⼌⼌  placement+ CD      1      1   20.68s i22r07c05s11
[...]
i22r07c05s05 > exit

flux has an elaborate direct help system. Please use flux help and flux help <command> to acquire some information or reminder.

flux submit/bulksubmit, flux cancel <job ID> and flux job -a can be used similarly to sbatch, scancel and squeue under Slurm. Maybe flux cancelall -f is a highlight in the first tests.

Non-Interactive Workflows

The far more normal approach to use flux is probably to have a bunch of tasks that should be bundled with a Slurm job. This comprises already the maximum scope of possible workflows, which we cannot cover here at all. But an example should illustrate the basic principle.

test.sh
#!/bin/bash
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D . 
#SBATCH -J flux_test
#SBATCH --get-user-env 
#SBATCH -M inter
#SBATCH -p cm2_inter
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --mail-type=none 
#SBATCH --export=NONE 
#SBATCH --time=00:02:00 

module load slurm_setup
module load flux-core          # or, conda activate my_flux

cat > workflow.sh << EOT
flux uptime
flux resource info
flux run --label-io -N2 hostname
# 100 tasks with each 7 CPUs
flux bulksubmit --wait --output=log.{{id}} -n 1 -c 7 /lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only -t 7 -d 20 ::: \$(seq 0 100)
# 1 MPI task with 2 nodes, 8 ranks, and 7 threads (CPUs) per rank
flux run --output=log.mpi -N 2 -n 8 -c 7 -o cpu-affinity=per-task /lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi -t 7
EOT
chmod u+x workflow.sh

srun --export=all --mpi=none flux start ./workflow.sh

With srun, the flux instances are started (one process per node), and also just with a script – workflow.sh. This script contains the actual flux workflow description. We use here some dummy programs which provide us with information about the rank/thread-to-cpu placement. It is probably a good idea to check the correctness of that.

This Slurm script is to be submitted as usual via sbatch.

NB: We tested here Intel MPI, where flux run works remarkably well concerning the rank/thread placement.

Remark: We found that the srun option --mpi=none worked on one cluster, on another, it needed to be --mpi=pmi2. We could not exactly figure out what the deeper background is. But we suspect that subtle differences in the Slurm version or configuration may cause that. Please try out what works. Or ask for help in our Service Desk.

Waitable Jobs

In general, flux submit would submit a job, and return to the shell. Specfically, in mass jobs within a Slurm job, that would lead to that the workflow-script above would then just return after the submission of the last flux job. To handle this, flux submit knows the option --flags=waitable. Together with a subsequent flux job wait --all, we have a similar idiom like the srun &; wait for Slurm job farming. However, the flux documentation claims that flux job wait is much more lightweight than bash wait.

Dependency Trees

flux submit also knows job dependencies via --dependency=... option. Here, ... can for instance be afterok:JOBID. That is sematically equal to Slurm's sbatch job dependencies.

After Slurm Job stops

flux seems not to have a job-bookkeeping device. So, some automatic restart from a certain state of progress in a job is probably not directly possible.
But flux queue offers some capabilities to document/archive the flux's queue status. Please check the cheat sheet below.

# Stop the queue, wait for running jobs to finish, and dump an archive.
flux queue stop
flux queue idle
flux dump ./archive.tar.gz

In order to execute that in a Slurm job, maybe some bash trap ... EXIT is necessary (where ... is some cleanup bash function).

Also, the following ideom can output valuable information about the flux jobs (please use flux jobs -o 'help', or, check the docu):

flux job wait --all
flux jobs -a -o '{id} {username} {ncores} {nnodes} {nodelist} {t_run} {t_cleanup} {runtime}'

Further Reading

Flux Framework comes with a vast scope of documentation, user guides and tutorials. We propose to beginners to start with the Learning Guide.

To bind Flux into a Slurm frame, please consult the docu on that.

As a good overview, the Cheat Sheet is of tremendous help.