General Information
Quantum ESPRESSO (QE) is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.
Discover available versions of Quantum ESPRESSO
ml av qe
Running Quantum ESPRESSO
QE is compiled using an Intel Classic compiler, linked against the libraries
- wannier90
- HDF5
- libXC
and can be parallelized using OpenMP, MPI or a combination thereof.
Note that OpenMP-Threading tends to performs worse than MPI for QE, so rather stay on the heavy MPI side and use additional OpenMP-threading very carefully, as it can easily run very inefficiently, while very rarely giving a minute benefit, see below.
Tipps & Tricks
- Familiarize with QE's levels of parallelization.
- When the memory of a single node (while using all available CPU cores) is a concern, then try increasing the number of nodes. QE will automatically distribute calculations and datastructures across all tasks, so the memory per task decreases with increasing tasks.
- Pay attention to Notes and Warnings regarding parallelization at the beginning of QE output. If the number of tasks or nodes is not appropriate, QE will spill out warnings. Do not run such ill-parallelized calculations! Typical warnings are:
- WARNING: too many processors for an effective parallelization!
- suboptimal parallelization: some nodes have no k-points!
Sample Slurm Job
For QE the recommendation is to request relative ressources ( *-per-*=
) and first increase --tasks-per-node
up to 128 before increasing --nodes
to avoid inter-node communication, which is slower than intra-node communication.
#!/usr/bin/env bash #SBATCH --job-name=qe #SBATCH --partition=epyc #SBATCH --nodes=1 #SBATCH --tasks-per-node=16 #SBATCH --cpus-per-task=1 # recommended #SBATCH --mem-per-cpu=4G #SBATCH --mail-type=END,INVALID_DEPEND,TIME_LIMIT # replace the email with your personal one in order to receive mail notifications: #SBATCH --mail-user=noreply@physik.uni-augsburg.de #SBATCH --time=1-0 ml purge ml load qe/7.3 srun pw.x < job.in > job-${SLURM_JOB_ID}.out
Running Quantum ESPRESSO on GPU
A seperate module (look out for the red built for GPU(g)
flag when using ml av qe
) provides a version of QE compiled with NVHPC-compiler which allows running most (but not all) functionalities of QE on GPU.
Note on GPU efficiency vs CPU
According to QE developers, running QE on GPU can reduce the computational time by a factor of 2-3. Therefore, don't expect too much benefit. According to our own measurements (see below) one A100 GPU is about three times faster than single 64-core CPU, or one A100 equals 1.5 CPU nodes. In contrast, scaling to more GPUs has shown to be inefficient (waste of ressources), while scaling to more CPUs works much better.
Not all parts of QE are parallelized via GPUs. Running not-yet-implemented parts of QE on GPU-nodes is not allowed. Please check GPU-efficiency after running small test calculations.
#!/usr/bin/env bash #SBATCH --job-name=qe #SBATCH --partition=epyc-gpu #SBATCH --nodes=1 #SBATCH --tasks-per-node=2 # (between 1-3 for epyc-gpu nodes) #SBATCH --cpus-per-task=1 # recommended #SBATCH --gpus-per-task=1 # recommended #SBATCH --mem-per-cpu=4G #SBATCH --mail-type=END,INVALID_DEPEND,TIME_LIMIT # replace the email with your personal one in order to receive mail notifications: #SBATCH --mail-user=noreply@physik.uni-augsburg.de #SBATCH --time=1-0 ml purge ml load qe/7.3-ompi4.1-nvhpc24.1 srun pw.x < job.in > job-${SLURM_JOB_ID}.out
Pseudopotentials
All potentials of the have been made available via the environment variable PSEUDO_DIR
as part of the modules. Serverl different versions of SSSP potentials are located in their respective $SUBDIR
folders
Name | $SUBDIR |
---|---|
SSSP PBE Efficiency v1.3.0 | SSSP_1.3.0_PBE_efficiency |
SSSP PBE Precision v1.3.0 | SSSP_1.3.0_PBE_precision |
SSSP PBEsol Efficiency v1.3.0 | SSSP_1.3.0_PBEsol_efficiency |
SSSP PBEsol Precision v1.3.0 | SSSP_1.3.0_PBEsol_precision |
Benchmarks
AUSURF112 (CPU, small size test case)
Nodes | Tasks-per-Node | OMP-Threads | Time | Efficiency* | Comment |
---|---|---|---|---|---|
1 | 1 | 1 | 1h 9m | 234% | |
1 | 2 | 1 | 35m57.86s | 224% | |
1 | 4 | 1 | 19m22.61s | 208% | |
1 | 8 | 1 | 10m57.45s | 184% | |
1 | 16 | 1 | 6m21.78s | 158% | |
1 | 32 | 1 | 3m43.01s | 136% | |
1 | 64 | 1 | 2m31.15s | 100% | |
1 | 1 | 64 | 10m5.11s | 25 | Don't do it! |
1 | 128 | 1 | 1m26.30s | 88 | |
2 | 128 | 1 | 1m 2.14s | 61 |
*normalized to a full socket (64 cores). Here the unoccupied cores were empty and CPU clock was higher, almost 200% for a serial calculation. Since this is hard to seperate, the reference (100%) is a full socket.
AUSURF112 (GPU, small size test case)
Nodes | Tasks-per-Node | GPU-sbatch-Line | Time | Efficiency | Comment |
---|---|---|---|---|---|
1 | 1 | --gpus-per-task=1 | 30.11s | 100% | |
1 | 1 | --gpus-per-task=2 | 29.10s | <50% | Not worth it. |
1 | 2 | --gpus-per-task=1 | 29.55s | <50% | Not worth it. |
1 | 2 | --gpus-per-node=2 | 29.52s | <50% | Not worth it. |
GRIR443 (CPU, medium size test case)
Nodes | Tasks-per-Node | OMP-Threads | Time | Efficiency | Comment |
---|---|---|---|---|---|
1 | 64 | 1 | 44m33.31s | 100% | |
1 | 32 | 2 | 42m22.94s | 105% | |
1 | 16 | 4 | 47m24.97s | 94% | |
1 | 128 | 1 | 25m15.45s | 88% | |
1 | 64 | 2 | 26m35.25s | 84% | |
1 | 32 | 4 | 27m53.03s | 80% | |
2 | 128 | 1 | 12m25.88s | 90% | |
2 | 64 | 2 | 13m11.59s | 84% | |
2 | 32 | 4 | 41m 8.73s | 27% | no WARNING but very inefficient |
4 | 128 | 1 | 6m 5.27s | 91% | |
4 | 64 | 2 | 6m40.83s | 83% | |
4 | 32 | 4 | 23m 7.58s | 24% | no WARNING but very inefficient |
8 | 128 | 1 | 3m35.34s | 78% | |
8 | 64 | 2 | 3m21.95s | 83% | |
8 | 32 | 4 | 11m16.31s | 25% | no WARNING but very inefficient |
16 | 128 | 1 | 8m17.43s | 17% | WARNINGS |
GRIR443 (GPU, medium size test case)
Nodes | Tasks-per-Node | GPU-sbatch-Line | Time | Efficiency | Comment |
---|---|---|---|---|---|
1 | 1 | --gpus-per-task=1 | - | - | Out of GPU-Memory |
1 | 2 | --gpus-per-task=1 | 8m18.45s | 100% | |
1 | 1 | --gpus-per-task=2 | - | - | Out of GPU-Memory |
1 | 2 | --gpus-per-node=2 | 8m13.33s | 101% | Each MPI Task will see 2 GPUs, hardly a benefit. |
2 | 2 | --gpus-per-task=1 | 5m32.41s | 75% | Not worth it. |
4 | 2 | --gpus-per-task=1 | 4m43.30s | 44% | Not worth it. |
Support
If you have any problems with Quantum ESPRESSO please contact the team of IT-Physik (preferred) or the HPC-Servicedesk.