We announced a maintenance window for both clusters
ALCC and LiCCA to update the Slurm version to 25.11.
One of the main reasons are improvements to the
GPU allocation for Slurm jobs,
which is broken in the current version 25.05.
We might still have to adjust the Slurm configuration
for GPU job handling in the days following the update,
meaning eventually draining and restarting Slurm
daemons again.
We will at least temporarily lower the TimeLimit
in the GPU partitions from 3 to 2 days.
This might cause some inconvenience for long time active users,
but will provide a good alternative to cancelling/killing jobs
due to required restarts of the system.
Since the last major upgrade of both clusters ALCC
and LiCCA in July, we observe some problems with
Slurm jobs allocating GPUs, and with our Slurm accounting
database. Recent Slurm updates (Slurm version 25.11) should
fix these problems.
Maintenance schedule:
- Friday, 28.November, 9:00, set all partitions to drain
- Monday, 1.December, 9:00, start of Slurm update
-- GPU partitions drained
-- CPU partitions draining, runnning jobs continue,
job survival not guaranteed
- Monday, 1.December: we plan to resume all partitions till 18:00
- login nodes will not be available for users until
the maintenance is finished.