Nextflow on HPC-Systems (Test Operation)
Please read this through completely before starting on our cluster systems!!
Disclaimer: We are neither nextflow users, nor developers. We can help with the integration of nextflow into the LRZ HPC/Slurm workflows. The rest we must map back to the nextflow user forum, or nextflow developers through you who is interested in using this tool!
What is Nextflow?
Nextflow is a tool for complex workflow control, and task farming - meant as simplification and standardization for users.
Starting with such high ideals, the reality is probably more differentiated.
Nextflow allows interfacing to diverse resource manager and schedulers. It interfaces conda and container environment. The workflow tree description happens in a domain specific language (DSL) in Groovy.
This way, it is really versatile and flexible. But is also somewhat obscuring the underlying complexity related to the handling of schedulers, and environments/containers, or just the interaction of the tasks within an HPC compute node.
All this also matters when one wants to be successful.
Conceptual Use on LRZ Clusters
Basic interactive Usage
We don't want nextflow production runs on Login Nodes!!
The simple background here is that nextflow easily overwhelms a system when it is wrongly configured.
Either, the login node is spilled with local tasks, which disturbs other users. Login nodes are shared nodes!
Or, even if nextflow is correctly configured, it produces many often very small tasks spilling the Slurm server, which is not really contemporary HPC.
Therefore, we offer interactive nodes, where nextflow workflows can be tested. If you are running (for tests, often sufficient) just on a single node, please use the following recipe:
login-node> salloc -N 1 compute-node> module use /lrz/sys/share/modules/extfiles/ # for the duration of test operation compute-node> module load nextflow # possibly load other modules here, or setup your work environment compute-node> nextflow run test.nf
Go sure to use the 'local
' executor here (is the default, usually)!
For testing on more than one node, a scheduler is needed. Indeed there exists Slurm-support by nextflow. Unfortunately, only via sbatch
as script submitting tool.
The compute nodes at LRZ are NO Slurm submit hosts for security reasons. sbatch
does not work on them!
Therefore, we conceived a workaround via flux-framework. Nextflow does not support it natively. So, we modified the existing flux executor in order to make it working with our flux installation.
login-node> module use /lrz/sys/share/modules/extfiles/ # for the duration of test operation login-node> module load nextflow/24.04.2 # java and flux-core are loaded as prerequisites login-node> srun -N 2 -M inter -p cm2_inter --pty flux start compute-node> flux resource info 2 Nodes, 56 Cores, 0 GPUs compute-node> nextflow run test.nf N E X T F L O W ~ version 24.04.2 Launching `my_script.nf` [astonishing_hypatia] DSL2 - revision: 3b790bbc15 executor > flux (9) ... compute-node> <Ctrl+D>
Please use this very version 24.04.2 of nextflow! The other one was not modified for this purpose!!
In order to set the executor to flux
, there are different ways. In a nextflow process, one can set executor 'flux'
. Or, you can create a nextflown.config
(for project) or a ~/.nextflow/config
(globally). Please consult nextflow's documentation on that!
As a complete example, we also show a file test.nf, which contains as executor 'flux'
! Furthermore, we decided to use 8 CPUs for the foo process, which is given then to the program via ${task.cpus}
, such that it knows how many threads to create.
As common, the results of the nextflow run is located inside the work
directory.
Usage in non-interactive Batch Scripts
Having prepared a nextflow file as above the integration into a Slurm job script is then usually simple. To reuse the example from above:
That script should be fully functional, and can be used as a starting point for testing own workflows.
The first section is just the normal Slurm header. Using flux, only --nodes=...
and --ntasks-per-node=1
are to be specified. Next, mandatory and optional modules are loaded in order to set the environment.
In the last line (srun ...
), flux is started with some workflow script (that's prescribed by flux-framework). Within this script, we use here now nextflow in order to create nextflow tasks via flux submit
(under the hood).
The actual nextflow file can, of course, already be prepared before job submission. We use a HERE document only for a self-contained representation (please consult any bash manual about that).
Remarks:
-resume
is only necessary, when the job is resubmitted/requeued. As with all HPC jobs, also a nextflow job may fail (node failure), or simply run into the wallclock time. No reason for panic! On resubmitting the job, nextflow can start from were it stopped before (approximately).- Warning! Some tasks may be excessive in resource requirements ... like runtime, or memory. By this, essentially all parallelism might be prevented, or the whole Slurm job even fail!
Therefore: Test and Know your tasks well before production!! Probably not everything can be handled on our systems! flux
has no mean to express memory requirements, yet (June'24) in a task. One can however achieve a similar thing by specifyingcpus
in a process' directive section. Knowing that on CoolMUC-2 there are 2 GB per CPU, one can specify several CPUs to get a total amount of necessary memory. If the task's program then does not parallelize on that many threads, start it with less threads (or one, if serial). It is in the user's responsibility to do a correct resource management! Flux only supports the scheduling and distribution of processes to the hardware. But the nextflow user must get the parallelism of the tasks right on its own.
Troubleshooting and FAQs
- Q: How do I see that all the simultaneous tasks are really running on different CPUs? (What if I see strong performance drops?)
A: That's definitely not easy, and one usually has to rely on that flux does a good job here. However, the placement-test mentioned several times shows you its own cpu-occupation, and also runtime time-stamps. That's not meant as a permanent monitor, however! But only for tests! Just replace your task's executable by the placement test, let it run, and check it's output. Some day, flux is maybe capable to provide a similar functionality as Slurm'ssacct
. - Q: Many of my tasks fail! What should I do?
A: When things go wrong, one needs to debug. Look for errors in the task's output! Let the task run independently! Instrument the task with monitors (e.g. wrap it in\time -v ...
in order to see excessive memory uses, etc.) - Q: Conda/Container Usage?
A: Conda environments and containers (charliecloud or singularity are supported on LRZ HPC clusters only, currently) are convenient means to install prebuild software. Be aware that these generally don't exploit the cluster hardware performance (e.g. vectorization)! So, are maybe not really performing, i.e. efficient!
nextflow wraps the installation and running of conda environments and containers, which can be a challenge to debug on its own should go something wrong.
Some executable may require MPI parallelism, where the conda installer/container builder installs maybe MPI implementations that are not immediately easy to use on our cluster. Supported is Intel MPI and OpenMPI (but not any version!).
Our general recommendation currently is to avoid conda environments/containers altogether if possible! Better install the software natively. If help is needed, please ask us via the Service Desk!
Still, we are working here also on improvements.
Documentation
Final Plea
If this documentation requires changes in your opinion, because it is for instance unclear, ambiguous, plain wrong, or in any other way not working or acceptable, please help us to improve it! We are grateful for any remarks, wishes or just comments from your side, and we depend here on your active collaboration.