High Performance Computing
<< Zurück zur Dokumentationsstartseite
High Performance Computing
Forgot your Password? click here
Add new user (only for SuperMUC-NG)? click here
Add new IP(only for SuperMUC-NG)? click here
How to write good LRZ Service Requests? click here
How to setup two-factor authentication (2FA) on HPC systems? click here
End of Life: CoolMUC-2 and CoolMUC-3 will be switched off on Friday December 13th
New: Virtual "HPC Lounge" to ask question and get advice. Every Wednesday, 2:00pm - 3:00pm
For details and Zoom Link see: HPC Lounge
System Status (see also: Access and Overview of HPC Systems)
GREEN = fully operational YELLOW = operational with restrictions (see messages below) RED = not available = see messages below
Höchstleistungsrechner (SuperMUC-NG) | |
login nodes: skx.supermuc.lrz.de LOGIN | |
archive nodes: skx-arch.supermuc.lrz.de ARCHIVE | |
File Systems | |
Partitions/Queues: FAT TEST | |
Detailed node status | |
Details:
| |
Submit an Incident Ticket for the SuperMUC-NG Add new user? click here Add new IP? click here Questions about 2FA on SuperMUC-NG? click here |
Linux Cluster | |||
CoolMUC-2 | see messages below | ||
lxlogin(1,2,3,4).lrz.de | UP |
| |
serial partition serial_std | MOSTLY UP |
| |
serial partition serial_long | MOSTLY UP | ||
parallel partitions cm2_(std,large) | DEAD FOR GOOD | ||
cluster cm2_tiny | UP | ||
interactive partition: cm2_inter | DEAD FOR GOOD | ||
c2pap | UP |
| |
C2PAP Work filesystem: /gpfs/work | READ-ONLY | ||
CoolMUC-3 lxlogin(8,9).lrz.de parallel partition: mpp3_batch interactive partition: mpp3_inter | NO ACCESS MOSTLY UP UP | ||
CoolMUC-4 lxlogin5.lrz.de interactive partition: cm4_inter_large_mem | UP MOSTLY UP | ||
others | |||
teramem_inter | UP |
| |
kcs | PARTIALLY UP |
| |
biohpc | MOSTLY UP |
| |
hpda | UP |
| |
File Systems HOME | ISSUES | | |
Details: | |||
|
Compute Cloud and | ||
---|---|---|
Compute Cloud: (https://cc.lrz.de) detailed status: Status | UP | |
LRZ AI Systems | UP | |
Details: | ||
DSS Storage systems |
---|
For the status overview of the Data Science Storage please go to https://doku.lrz.de/display/PUBLIC/Data+Science+Storage+Statuspage |
Messages
see also: Aktuelle LRZ-Informationen / News from LRZ
Messages for all HPC System |
A new software stack (spack/23.1.0) is available on the CoolMUC- 2 and SuperMUC-NG. Release Notes of Spack/23.1.0 Software Stack |
Messages for SuperMUC-NG |
Maintenance finished. System is back in operation. |
Messages for Linux Clusters |
lxlogin8 no longer accessible for usersThe login node lxlogin8 is not accessible to users anymore. CoolMUC-3 SLURM jobs can be submitted from any functional login node lxlogin1,2,3,4 for the remaining lifetime of CoolMUC-3 until 13th December 2024 at the latest. |
Cluster maintenance from Nov 11th 2024 until Nov 15th 2024Update The maintenance will be finished on Nov 20. Affected cluster segments will be back in operation. Please note: The latest LRZ software stack spack/23.1.0 is set as default on the CoolMUC-4 partitions! The old software stack spack/22.2.1 (commonly used on cm2 and cm4_inter_large_mem nodes in the past) is still available via the according module. -- Update The maintenance needs to be prolonged until Nov 19th! -- Original announcement: Due to works on the power grid infrastructure and security relevant system updates all denoted cluster segments are in maintenance from Monday, Nov 11th 2024 at 06:30am until Friday, Nov 15th 2024 at approx. 6:00pm: CoolMUC-3 Cluster:
CoolMUC-4 Cluster:
This means that neither scripted batch jobs nor “salloc” style interactive jobs will execute. |
cm2/cm2_inter are gone for ever Due to some further hardware failure, the complete island 22 went out of operation. This concerns also housing clusters attached to the same network. Customers are informed by mail. |
9:30 a.m.: Outage SCRATCH_DSS The infrastructure maintenance affected the SCRATCH_DSS filesystem and lead to an outage. We are working to resolve the problem. |
CoolMUC-2/-3: For the CM-2 queues due to degeneration of the cluster communication network they are open for single-node jobs only. SLURM restrictions apply. For CM-3 multi-node jobs can be submitted again. Please abstain from submitting tickets about software modernization requests on both systems. The systems are provided "as is" for the remaining lifetime. (see below) |
Legacy SCRATCH File System of CoolMUC-2/3 Broken - Data recovery On severe hardware failures occured on the CoolMUC clusters (SCRATCH filesystem, switches). As a mitigation, until end-of-life of CoolMUC-2/3, we have mapped the SCRATCH variable to SCRATCH_DSS (/dss/lxclscratch/.../$USER) also accessible now on CoolMUC-2. Update: Our administrators managed to bring the filesystem back up in read-only mode:
Please do not use the $SCRATCH environment variable, rather absolute paths, e.g., /gpfs/scratch/<project-id>/<user-id>. We cannot guarantee data integrity or completeness. Please save all relevant files as soon as possible. Filesystem was unmounted November 9. |
End-of-Life Announcement for CoolMUC-2After 9 years of operation the hardware of CoolMUC-2 can no longer offer reliable service. The system is targeted to be turned off latest Friday . Due to network degradation we can only support single node jobs on a best-effort basis until then. In case of further hardware problems, the shutdown date might be much earlier. |
End-of-Life Announcement of CoolMUC-3Hardware and software support for the Knights Landing nodes and the Omni Path network on CoolMUC-3 (mpp3_batch) has ended several years ago and needs to be decommissioned. The system is targeted to be turned off Friday along with CoolMUC-2. In case of further hardware problems, the shutdown date might be earlier. |
New Cluster Segment CoolMUC-4Hardware for a new cluster system, CoolMUC-4, has been delivered and is currently being installed and tested. The cluster comprises some ~12.000 cores based on Intel® Xeon®Platinum 8480+ (Sapphire Rapids). We expect start of user operation beginning of December 2024. |
Messages for Compute Cloud and other HPC Systems |
The AI Systems will be affected by an infrastructure power cut scheduled in November 2024. The following system partitions will become unavailable for 3 days during the specified time frame. We apologise for the inconvenience associated with that. Calendar Week 46, 2024-11-11 - 2024-11-13
The AI Systems (including the MCML system segment) are under maintenance between September 30th and October 2nd, 2024. On these days, the system will not be available to users. Normal user operation is expected to resume during the course of Wednesday, October 2nd. The previously announced scheduled downtime between 2024-09-16 and 2024-09-27 (Calendar Week 38 & 39) has been postponed until further notice. The system will remain in user operation up to the scheduled maintenance at the end of September. |
HPC Services
Attended Cloud Housing |
More Links