nf-core experience report

 

Plot

The goal of this analysis was it to survey different options for using nextflow/nf-core workflows on our HPC clusters with the goal to collect experiences on how to accomplish high/efficient production rates. It was not only about feasibility, but also to figure out the technical details, and - if possible - generic workflows, which can be easily copied by users in order for their goal of fast and successful analyses of their data.

Niklas Schandry provided a nf-core profile, which is supposed to support the future users. If, of course, will need to be adapted with the time to match the changes our clusters necessarily will pass through. In a long and imho fruitful discussion with Niklas, we could crystallize out some set of reasonably generic workflows future users hopefully can reuse and adapt for their own needs if necessary.

As example, we chose the nf-core/rnaseq pipeline with some sample data (see "technical details"). We tested "local", "flux" and "hq" executors. All of them were encapsulated into and confined in Slurm jobs for the sake to spare the login nodes (automatic job generation and polling of status puts too much load onto our Slurm infrastructure). "slurm" executor is then no option - also because compute nodes are no submit hosts. That's the deeper reason for us to scrutinize other options.

We performed this analysis on the CoolMUC-4 cluster "cm4", as that represents the vastly largest compute resource in this LRZ HPC cluster. (There are high and large memory special systems. But they are rare and high occupied, and meant for special purposes.)


Some words of caution: Nextflow is a workflow manager supposed to work through more or less complex task dependency tasks, which is a representation of the analysis workflow intended on input data. The necessity for such a workflow manager seems to result from a sometimes really extreme variance in resource requirements (mostly number of CPUs, memory and runtime) of the single tasks. It is thus a tough task for a scheduler to optimally or efficiently bring these tasks onto the available resources. Unfortunately, it seems further that such task's requirements (foremost memory and runtime) differ even by order of magnitudes DEPENDING ON THE INPUT DATA.
As a consequence, the present analysis might not at all match user's individual experience. It is thus inevitable to check each time with new input data or a new analysis pipeline, whether a given workflow setup and scheduler still reasonably works!

We try to give some recommendations for a reasonable approach. But we are afraid that these cannot be as systematically followed as would be desirable. This is also related to the capabilities  - or better, the missing ones - of the schedulers used. But not only. The nf-core pipeline's task description (the nextflow process resource description) might not always be sufficient – especially on a given cluster system with special nodes resource properties. Flux, for instance, in the way we use it atm., cannot take care for memory requirements (less so enforce them - but that's anyway counterproductive for the proposed approach). This might lead to situations where several tasks run concurrently and consume together more memory than available on the system. This leads to out-of-memory events, in which some tasks are killed. In such cases, nextflow jobs must be resubmitted with the -resume flag. This still doesn't guarantee success. A reasonable survey of the single tasks requirements, and possibly fine tuning is necessary. The provided nf-core profile thus activates all kinds of reporting and tracing.

Overview and Conclusions

Cases and Time Results

The following table gives an overview of the different workflow options we used. The nextflow config and the data samples were identical. Only the schedulers and the used resources differ in one or another way.

Case NameExecutor

Nodes

HWT *)

Total Runtime

Remarks

Slurm Job Script

nf_loclocal1(Fehler) 13:42:511x OOM; 1x restart ***)
nf_loc_hwtlocal1(Haken) 11:06:57
nf_fluxflux1(Fehler) -2x OOM; 1x restart (not finished)
nf_flux_hwtflux1(Haken) 08:15:11
nf_flux_hwt_2Nflux2(Haken) 04:45:39**)
nf_flux_hwt_4Nflux4(Haken) 03:36:09**)
nf_hqhq1(Fehler) 10:51:05hwt was presumably still used by HQ
nf_hq_hwthq1(Haken) 10:52:50
nf_hq_hwt_2Nhq2(Haken) 05:42:20
nf_hq_hwt_4Nhq4(Haken) 04:00:21

Each case was run in a separat directory. Each directory only contained the config file, lrz_cm4.config, the data description file, samplesheet.csv, and the respective Slurm job script, which was submitted from that directory via sbatch <job script>.


*) HWT means that we activated the use of all logical CPU cores. The default for HPC is usually that processes run alone on a physical CPU, which comprises here two logical CPUs, which share L1 and L2 caches (and some other things). But not so efficient codes can benefit from this hardware-threading

For comparability of the run-times, we did NOT use NFX_APPTAINER_CACHEDIR. In each primary run, the containers where thus downloaded during the job, which mostly is not too significant a contribution the run-time.
In production, it is but highly recommended to set this environment variable, and benefit from container reuse!

**) Slurm registered these runs as FAILED. But the Slurm log indicates a successful nextflow termination. It seems that some flux processes (srun tasks) kept hanging. The reason is not yet clear. But nextflow finished definitely successfully.

***) OOM (out-of-memory) events are not problematic, in principle. They mostly kill only some tasks of the nextflow pipeline. On re-submission of the Slurm job with nextflow extended by -resume, has a good chance to succeed in the second attempt.
Please keep in mind that a perfect pass-through is not necessary. The major goal of the job farming approach using nextflow is to get the majority of tasks through. Single tasks, or pipelines with single input files can be processed separately if necessary.
High throughput is the goal!

Conclusions

  1. Hardware-threading (i.e. using all available logical CPUs) is beneficial.
  2. Using Flux scheduler seems to work best.
  3. Flux appears to scale well on more nodes, if sufficient workload is there (no task-dependency bottlenecks, and sufficiently finely-grained tasks).
    Speed-up on 2 nodes: 1.73 (parallel efficiency: 86.7%)
    Speed-up on 4 nodes: 2.29 (parallel efficiency: 57.3%) ... We suppose that there were not enough tasks that could run at the same time.

We generally assume that a parallel efficiency of less than 70% is too wasteful. But for workflows similar to the present one, or with different input files (number of files, different size, different content), it is nearly impossible to estimate the total resource requirements.
It is therefore probably advisable to start with the analysis on a single node, and observe the advancement. And only from this try to judge what the total resource requirements are. If you estimate to need more than 100_000 CPU-hours, Linux Cluster is most probably too small. (If you find some optimization potential to accelerate the single task execution somehow, you could try to re-evaluate the situation. But as nf-core users rarely develop the tools used for the analysis, or even compile them themselves, we don't see a lot of possibilities of optimization.)

Additional Explanations:

Speed-up: T1 is the run-time for nf_flux_hwt, T2 that for nf_flux_hwt_2N, T4 that for nf_flux_hwt_4N (say, in seconds).
                   S2 = T1/T2 = 1.73, S4 = T1/T4 = 2.29

Parallel Efficiency: P2 = 100% S2/2 = 86.7%, P4 = 100% S4/4 = 57.3%

Technical Details

The Setup (nextflow, flux, hq, apptainer)

The simplest approach appears to be some conda environment. We created one in order to provide latest versions of the required software. Everything was placed into a SCRATCH directory in order to spare the HOME file system. But that's not mandatory/essential.

mkdir $SCRATCH_DSS && cd $SCRATCH_DSS
module load micromamba
micromamba create -p $PWD/env_nfcore -c conda-forge -c bioconda nextflow nf-core apptainer flux-core flux-sched

This environment was then activated via micromamba activate $SCRATCH_DSS/env_nfcore whenever needed.

The used Config (Profile)

lrz_cm4.config
/* ----------------------------------------------------
 * Nextflow config file for the LRZ cm4 cluster
 * ----------------------------------------------------
 */

manifest {
    name = 'LRZ CM4 Configuration'
    author = 'Amit Fenn, Niklas Schandry, Frederik Dröst'
    homePage = 'plantmicrobe.de'
    description = 'Configuration for LRZ CM4 cluster'
}

params {
    // Configuration metadata
    config_profile_name = 'LRZ CM4'
    config_profile_description = 'LRZ CM4 configuration'
    config_profile_contact = 'Amit Fenn (@amitfenn); Niklas Schandry(@nschan)'
    config_profile_url = 'https://doku.lrz.de/job-processing-on-the-linux-cluster-10745970.html/'
//    config_version = '1.0.0'
    // Default output directory (relative to launch directory)
    outdir = 'results'
}

apptainer {
    enabled = true
    autoMounts = true
}

process {

    executor = 
        System.getenv("FLUX_URI") ? // If this is set we are in a flux-in-slurm situation
            'flux' : // Since we only support flux and local approaches, the alternative is local
            'local' 

    resourceLimits = [
        cpus:   System.getenv("SLURM_CPUS_ON_NODE") ? // for <1 node, we use slurm and can use this var
                    System.getenv("SLURM_CPUS_ON_NODE").toInteger() :  // for > 1 node, we use flux, the maximum we can allocate to one job 112 CPU
                    112,
        memory: System.getenv("SLURM_CPUS_ON_NODE") ? // if we are in a slurm job, we assume that MEM-per-CPU is 4.5GB
                    (System.getenv("SLURM_CPUS_ON_NODE").toInteger() * 4500.MB) : 
                    480.GB // if we are not in a slurm job, we are in a node-spanning flux job, and one job can use up to 480.GB (488GB available per node)
    ]
}


trace {
    enabled = true
    overwrite = true
}

report {
    enabled = true
    overwrite = true
}

timeline {
    enabled = true
    overwrite = true
}

dag {
    enabled = true
    overwrite = true
}

params {
    genome = 'GRCh37'
    pseudo_aligner = 'salmon'
}

Please note that in future, nextflow/nf-core could be simply started with the -profile lrz_cm4 option. A download is not necessary anymore once the profile is published.

This profile also contained a module load apptainer. This needed to be commented out as it overrode the conda environment's apptainer, and caused problems.

Hyperqueue (hq) executor is not included in this profile. It can be activated like any other executor, too, via -process.executor=hq on nextflow's command line. HQ is also not available as conda package. But installation is rather simple: download and unpack a file. It contains only one executable.

The params section with the genome and pseudo_aligner members is not part of the profile. These params should also be specified on nextflow's command line, to address the desired analysis. Please check nextflow's docu on that.

The used Data-Sample

samplesheet.csv
sample,fastq_1,fastq_2,strandedness
GM12878_REP1,s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_2.fastq.gz,reverse
GM12878_REP2,s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_2.fastq.gz,reverse
K562_REP1,s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_2.fastq.gz,reverse
K562_REP2,s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_2.fastq.gz,reverse
MCF7_REP1,s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_2.fastq.gz,reverse
MCF7_REP2,s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_2.fastq.gz,reverse
H1_REP1,s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_2.fastq.gz,reverse
H1_REP2,s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_2.fastq.gz,reverse
GM12878_REP3,s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_2.fastq.gz,reverse
GM12878_REP4,s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_2.fastq.gz,reverse
K562_REP3,s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_2.fastq.gz,reverse
K562_REP4,s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_2.fastq.gz,reverse
MCF7_REP3,s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_2.fastq.gz,reverse
MCF7_REP4,s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_2.fastq.gz,reverse
H1_REP3,s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_2.fastq.gz,reverse
H1_REP4,s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_2.fastq.gz,reverse
GM12878_REP5,s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_2.fastq.gz,reverse
GM12878_REP6,s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_2.fastq.gz,reverse
K562_REP5,s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_2.fastq.gz,reverse
K562_REP6,s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_2.fastq.gz,reverse
MCF7_REP5,s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_2.fastq.gz,reverse
MCF7_REP6,s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_2.fastq.gz,reverse
H1_REP5,s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_2.fastq.gz,reverse
H1_REP6,s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_1.fastq.gz,s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_2.fastq.gz,reverse