Resources

Status

Access

Data Transfer

File Systems

Environments

Interactive

Submitting Jobs

Slurm

Software & Libraries

Tuning

FAQ & Troubleshooting

Service & Support

News


FAQ and Troubleshooting

Content


News

ALCC and LiCCA maintenance

Both clusters ALCC and LiCCA are back online.

We announced a maintenance window for both clusters
ALCC and LiCCA to update the Slurm version to 25.11.
One of the main reasons are improvements to the
GPU allocation for Slurm jobs,
which is broken in the current version 25.05.

We might still have to adjust the Slurm configuration
for GPU job handling in the days following the update,
meaning eventually draining and restarting Slurm
daemons again.

We will at least temporarily lower the TimeLimit
in the GPU partitions from 3 to 2 days.
This might cause some inconvenience for long time active users,
but will provide a good alternative to cancelling/killing jobs
due to required restarts of the system.

Since the last major upgrade of both clusters ALCC
and LiCCA in July, we observe some problems with
Slurm jobs allocating GPUs, and with our Slurm accounting
database. Recent Slurm updates (Slurm version 25.11) should
fix these problems.

Maintenance schedule:

- Friday, 28.November, 9:00, set all partitions to drain
- Monday, 1.December, 9:00, start of Slurm update
-- GPU partitions drained
-- CPU partitions draining, runnning jobs continue,
job survival not guaranteed
- Monday, 1.December: we plan to resume all partitions till 18:00
- login nodes will not be available for users until
the maintenance is finished.

The July module updates and deprecations have been rolled out today. Please look out for deprecation warnings in your Slurm output.

After the Maintenance and Upgrade to Ubuntu 24.04 there are two major changes:

- The default cuda version is now v12.8, since this is what the nvidia driver natively supports.
- The intel/2023 compilers need a compatible GNU compiler. Unfortunately intel/2023 compilers are not compatible with gcc v13 which is the new Ubuntu default.
   => When loading intel/2023, gcc v11.5 compilers will now be loaded as well (but without overriding CC, CXX or FC environment variables)

Also please be aware of the following module changes, which just have been deployed (if default appears at the end, this is the new default!):

New/updated scientific Modules:

cp2k/2025.2-ompi5.0-cuda12.8-gcc13.2
cp2k/2025.2-ompi5.0-gcc13.2 (default)
comsol/6.3.0.335 (default)
elk/10.5.16-impi2021.10-intel2023.2 (default)
gromacs/2025.2-ompi5.0-gcc13.2-mkl2023.2-cuda12.9
lammps/20240829.4-ompi5.0-cuda12.9-gcc13.2
lammps/20240829.4-ompi5.0-gcc13.2 (default)
lammps/20250722.0-ompi5.0-cuda12.9-gcc13.2
lammps/20250722.0-ompi5.0-gcc13.2
mathematica/14.2.1 (default)
orca/6.1.0 (default)
qchem/6.3.0 (default)
qe/7.4.1-impi2021.10-intel2023.2 (default)
qe/7.4.1-ompi4.1-nvhpc24.1
siesta/5.4.0-ompi5.0-cf (default)
vasp6/6.5.1-impi2021.10-intel2023.2 (default)
vasp6/6.5.1-cuda12.3-ompi4.1-nvhpc24.1
vasp6/python3.12/6.5.1-impi2021.10-intel2023.2

New/updated common Modules:

cmake/3.31.8 (default)
cmake/4.0.3
cuda-compat/12.9.1
cuda/12.8.1 (default, in line with the CUDA level of the Nvidia driver)
cuda/12.9.1
emacs/30.1 (default)
gdrcopy/2.5
meson/1.8.3 (default)
micromamba/2.3.0 (default)
nccl/cu12.8/2.26.2
ninja/1.13.2 (default)
parallel/20250622 (default)
pmix/5.0.7 (default)
R/4.4.3-cf (default)
ucc/cu11x/1.4.4 (default)
ucc/cu12x/1.4.4 (default)
ucx/cu11x/1.19.0 (default)
ucx/cu12x/1.19.0 (default)

New/updated library Modules:

hdf5/1.14.6 (for compilers gcc/9.5, gcc/11.5, gcc/13.2, intel/2021.4, intel/2023.2, intel/2024.2, nvhpc/24.1) (default)
libxc/7.0.0 (for compilers gcc/13.2, intel/2023.2, intel/2024.2) (default)
openblas/lp64/0.3.30 (for compilers gcc/9.5, gcc/11.5, gcc/13.2) (default)
openblas/ilp64/0.3.30 (for compilers gcc/9.5, gcc/11.5, gcc/13.2)
gmp/6.3.0 (for compilers gcc/13.2) (default)
sqlite3/3.50.4 (for compilers gcc/9.5, gcc/11.5, gcc/13.2) (default)
tblite/0.4.0 (for compilers gcc/13.2) (default)

New/updated MPI Modules:

openmpi/4.1.8 (for compilers gcc/9.5, gcc/11.5, gcc/13.2, intel/2021.4, intel/2023.2, intel/2024.2) (default)
openmpi/5.0.8 (for compilers gcc/9.5, gcc/11.5, gcc/13.2, intel/2021.4, intel/2023.2, intel/2024.2) (default)
hdf5/1.14.6 (for compilers gcc/9.5, gcc/11.5, gcc/13.2, intel/2021.4, intel/2023.2, intel/2024.2, nvhpc/24.1) (default)
netcdf/c/4.9.3 (for compilers gcc/13, intel/2023.2, intel/2024.2) (default)
netcdf/fortran/4.6.2 (for compilers gcc/13, intel/2023.2, intel/2024.2) (default)
pnetcdf/1.14.0 (for compilers gcc/13, intel/2023.2, intel/2024.2) (default)


If you experience problems with any module, please let us know! 

The clusters will be undergoing a one-week maintenance shutdown from July, 7th to July, 11. During this time, the system will be completely unavailable. We kindly ask that you take this into consideration while planning your computational tasks, and we appreciate your understanding and cooperation.

Reason for the Maintenance:

The reason for this planned maintenance is to implement several critical upgrades and improvements to the clusters' infrastructure. These updates are designed to enhance both the stability and security of the system. The maintenance tasks are outlined below:

1. Update to Ubuntu 24.04 on Compute and Management Nodes

To maintain compatibility with the latest software and to benefit from long-term support, all the cluster’s compute and management nodes will be upgraded to Ubuntu 24.04 LTS. This update will include new features, security patches, and enhancements that improve the stability and performance of the cluster. The upgrade will also ensure the system remains in a supported state, with access to the latest bug fixes and software optimizations.

2. HPC Data Cluster Update

The HPC data cluster file system will be upgraded to the latest supported version. This upgrade includes several improvements in performance, security, and overall system efficiency. By updating the cluster, we ensure that it remains in line with current best practices for HPC environments, enhancing both reliability and compatibility with the latest applications and workloads.

3. Upgrade to Slurm 25.05

We will also be upgrading the Slurm workload manager to version 25.05. This new version includes several significant improvements, including performance enhancements, new features for job scheduling, and bug fixes that address known issues. The upgrade will help improve the overall efficiency of job queuing and resource management within the cluster, enabling you to achieve better performance and usability.

4. Re-cabling and Adjusting of the Power Supply

Since the addition of new high-power consumption nodes to the cluster, it has become necessary to balance the overall power usage more efficiently. This requires a complete re-cabling and re-plugging of the power supply lines to ensure a more stable distribution of power across the system. As a result, the cluster will need to be shut down to perform this task safely. This operation is essential for optimizing the system’s power management and preventing potential overloads.

5. Network Isolation for Enhanced Security

As part of our ongoing efforts to enhance the security of the HPC cluster, we will be implementing additional isolation within the cluster's data network. This step is critical to protect sensitive data and improve the overall integrity of the network infrastructure. To achieve this, we will be performing re-cabling and reconfiguration of the network setup. This new network architecture will provide better isolation between internal and external traffic, mitigating any potential security risks.

Expected Downtime and Impact:

Please be aware that the maintenance period will result in complete downtime for the entire cluster. No jobs or tasks will be able to run during this time, and access to the system will be temporarily disabled. We recommend that you complete any critical tasks or jobs before the maintenance window begins.

Planned schedule:

Sunday, July, 6th, 12:00 Slurm stops accepting new jobs
Monday, July, 7th, 10:00 Both clusters will be shutdown,
                         running jobs will be terminated, pending jobs will be removed
Monday, July, 7th, 10:30 Start of Maintenance
Friday, July 11th, 17:00 End of Maintenance, clusters are back in working mode

As soon as maintenance is finished successfully, you will be informed directly via email over this list.

What you need to do:

  • If you have any data or active jobs on the cluster, please ensure that they are saved or completed before Monday, July, 7th, 10:00.
  • Please refrain from submitting new jobs or launching tasks Sunday, July, 6th, 12:00.
  • If you need additional support or have specific questions about the maintenance process, please reach out to us at your earliest convenience.

We understand that planned downtime can cause some disruption, and we apologize for any inconvenience this may cause. Our team is committed to completing this work as efficiently as possible to minimize the impact on your research and work.
As soon as maintenance is finished successfully, you will be informed directly via email over this list.