The cluster name by Ralf Utermann, the cluster logo by Mikheil Sekania
Slurm
- Slurm 101
- Slurm Queues
- Submitting Serial Jobs
- Submitting Interactive Jobs
- Submitting Parallel Jobs (MPI/OpenMP)
- Submitting GPU Jobs
- Submitting Array Jobs and Chain Jobs
- Handling Jobs running into TIMEOUT
- Accessing Webinterfaces (e.g. Jupyterlab, Ray) via SSH Tunnels
- Exclusive jobs for benchmarking
- Controlling the environment of a Job
FAQ and Troubleshooting
- How do I register myself to use the HPC resources?
- How do I get access to the LiCCA or ALCC resources?
- What kind of resources are available on LiCCA?
- What kind of resources are available on ALCC?
- How do I acknowledge the usage of HPC resources on LiCCA in publications?
- How do I acknowledge the usage of HPC resources on ALCC in publications?
- What Slurm Partitions (Queues) are available on LiCCA?
- What Slurm Partitions (Queues) are available on ALCC?
- What is Slurm?
- How do I use Slurm batch system
- How do I submit the serial calculations?
- How do I run multithreaded calculations?
- How do I run parallel calculations on several nodes?
- How do I run GPU based calculations?
- How do I check Slurm current schedule, queue?
- Is there some kind of Remote Desktop for the cluster?
- If I have a question which is not listed here?
- If I want to report a problem?
- Which version of Python could be used?
- Which Anaconda, Miniconda, Miniforge, Micromamba?
- How do I monitor live CPU/GPU/memory/disk utilization?
- How do I check my GPFS filesystem usage and quota situation?
- Popular Labels:
- Search for topics by keyword:
- Resources - ALCC
- Status - ALCC
- Access - ALCC
- Data Transfer - ALCC
- Parallel File System
- Environment Modules - ALCC
- Interactive (Debug) Runs (not Slurm) - ALCC
- Submitting Jobs - ALCC
- Slurm - ALCC
- HPC Software and Libraries - ALCC
- HPC Tuning Guide - ALCC
- FAQ and Troubleshooting - ALCC
- Service and Support - ALCC
- History of ALCC
As you know, Docker containers cannot be run natively in HPC environments, but they can be easily converted and executed using Apptainer (Documentation).
Currently an outdated apptainer installation is available without loading any module.
This installation of apptainer will be removed on 30th of November!
Please change your workflows to make use of more recent apptainer modules!
The November module updates and deprecations have been rolled out today. Please look out for deprecation warnings in your Slurm output.
Most notable new modules:
Common
aocc/5.0.0
aocl/ilp64/5.0.0
aocl/lp64/5.0.0
anaconda/2024.10
apptainer/1.3.5
cudnn/cu11x/9.5.1.17
cudnn/cu12x/9.5.1.17
micromamba/2.0.3
openjdk/8.u432-b06
openjdk/11.0.25+9
openjdk/17.0.13+11
openjdk/21.0.5+11
Scientific
gromacs/2024.4-ompi5.0-gcc13.2-mkl2023.2-cuda12.6
orca/6.0.1
siesta/5.2.0-ompi4.1-cf
The September module updates and deprecations have been rolled out yesterday. Please look out for deprecation warnings in your Slurm output.
Most notable new modules:
cuda-compat
modules for better compatibility using cuda toolskits > 12.2, see also https://collab.dvb.bayern/x/3PxdFw#NvidiaCUDAToolkit-CUDAToolkitinteroperability (cuda module needs to be loaded first)cuda-mps
module for automatically starting and stopping CUDA Multi-Process-Service (highly recommended for code that cannot saturate A100 GPUs!), see also https://collab.dvb.bayern/x/m-xdFw#SubmittingGPUJobs-CUDAMulti-Process-Serverformaximumefficiencyintel/2024.2.1
modules have been installed,intel/2024.1.0
is now deprecated, intel/2023.2.1 is still the default, see also heregcc/11.5.0
modules has been installed,gcc/11.4.0
is now deprecatedorca/6.0.0
is available now, see also https://collab.dvb.bayern/x/yfxdFw
Since Friday, 30.08., a quota notification system is active. Users will get an E-Mail message when their quota are exceeded. More information in our Knowledege Base: Quota regulations
On LiCCA a separate partition epyc-gpu-test has been created, and node licca047 with its 3 A100 GPUs moved to this partition. The TimeLimit in this partition is 6 hours, to give users the possibility to test with short job runs, while the bigger epyc-gpu is loaded with longer running jobs. If this is not really used, we will move GPUs (partially) back. All projects and users with GPU resources are automatically granted access to this partition.
COMSOL 6.2 has been installed in the HPC cluster filesystem.
login node shell> ml load comsol/6.2 Loading Comsol 6.2
6.1 will still be the default for a few days, please give feedback if 6.2 installation still
has problems. 6.2 will switch to be default 1 or 2 weeks.
The HPC clusters LiCCA and ALCC are used with increasing
intensity and the load on the power distribution rises.
Stronger power cables have been laid for the power
infrastructure supporting the HPC clusters.
A shutdown of all compute nodes is scheduled for:
Tuesday, May 14, 7:00
We will start to drain the queues on Saturday, May 11.
To the best of our knowledge the work on the
electrical system will be finished the same day,
so the clusters should be back on Wednesday, May 15.
We migrate the Slurm database instance (serving both the ALCC and the LiCCA cluster) to a different system starting today.
Slurm operation is planned to stay up during this time.
This should speed up Slurm operations after the migration, but
is also needed in preparation of the ALCC upgrade from Ubuntu 20.04 to 22.04, which will happen in the next weeks.
we are proud to announce the availability of LiCCA, a compute resource focused on research, open for members of the University of Augsburg.
Access to LiCCA is possible after registration of your chair or working group for a HPC project. The complete application workflow is described in the HPC Knowledge Base, as well as the cluster hardware and setup.
Questions and problems, which are not solved within the HPC Knowledge Base, can be addressed to the Service-Desk at the Service- & Supportportal or by E-Mail.
Happy computing, the RZ HPC team