Job Processing on the Linux-Cluster
- For details, please also read the Linux-Cluster subchapters!
- Status of Linux Cluster: High Performance Computing
Overview of cluster specifications and limits
Modified Job Specifications after CoolMUC-2 Hardware Failure (see High Performance Computing) | ||
Cluster specifications | Limits | |||||
---|---|---|---|---|---|---|
Slurm cluster | Slurm partition | Nodes | Node range | Maximum | Maximum running (submitted) jobs per user | Memory limit |
Cluster system: CoolMUC-2 (28-way Haswell-EP nodes with Infiniband FDR14 interconnect and 2 hardware threads per physical core) | ||||||
cm2 | cm2_large | 404 (overlapping | DISABLED | 56 per node | ||
cm2_std | 1 - 1 | 72 | ||||
cm2_tiny | cm2_tiny | 288 | 1 - 1 | 72 | ||
serial | serial_std | 96 (overlapping | 1 - 1 | 96 | ||
serial_long | 1 - 1 | 480 |
| |||
inter | cm2_inter | 12 | 1 - 1 | 2 | ||
cm2_inter_large_mem | 6 | 1 - 1 | 4 | 120 per node | ||
Cluster system: CoolMUC-4 (80-way Ice Lake nodes, 2 hardware threads per physical core) | ||||||
inter | cm4_inter_large_mem | 9 | 1 - 1 | 96 | 1 (2) | 1000 per node |
Cluster system: Teramem (single-node shared-memory system, 4 x Intel Xeon Platinum 8360HL, in total 96 physical cores, 2 hyperthreads per physical core, 6 TB memory) | ||||||
inter | teramem_inter | 1 | 1 - 1 (up to 64 logical cores) | 240 | 1 (2) | approx. 60 |
Cluster system: CoolMUC-3 (64-way Knight's Landing 7210F nodes with Intel Omnipath 100 interconnect and 4 hardware threads per physical core) | ||||||
mpp3 | mpp3_batch | 145 | 1 - 32 | 48 | 50 (dynamically adjusted | approx. 90 DDR plus 16 HBM per node |
inter | mpp3_inter | 3 | 1 - 3 | 2 | 1 (2) |
Overview of job processing
Slurm partition | Cluster- / Partition-specific | Typical job type | Recommended | Common/Exemplary Slurm commands for job management via squeue (show waiting/running jobs), |
---|---|---|---|---|
cm2_large | --clusters=cm2 | lxlogin1 lxlogin2 lxlogin3 lxlogin4 | squeue -M cm2 -u $USER | |
cm2_std | --clusters=cm2 | |||
cm2_tiny | --clusters=cm2_tiny | squeue -M cm2_tiny -u $USER | ||
serial_std | --clusters=serial | Shared use of compute nodes among users! | squeue -M serial -u $USER | |
serial_long | --clusters=serial | |||
cm2_inter | --clusters=inter | Do not run production jobs! | squeue -M inter -u $USER | |
cm2_inter_large_mem | --clusters=inter |
| ||
cm4_inter_large_mem | --clusters=inter |
| lxlogin5 | |
teramem_inter | --clusters=inter | lxlogin[1...4] lxlogin8 | ||
mpp3_inter | --clusters=inter | Do not run production jobs! | lxlogin8 | |
mpp3_batch | --clusters=mpp3 | squeue -M mpp3 -u $USER |
Submit hosts (login nodes)
Submit hosts are usually login nodes that permit to submit and manage batch jobs.
Cluster segment | Submit hosts | Remarks |
---|---|---|
CooLMUC-2 | lxlogin1, lxlogin2, lxlogin3, lxlogin4 | |
CooLMUC-3 | lxlogin8, lxlogin9 | lxlogin9 is accessible from lxlogin8 via ssh mpp3-login9 lxlogin9 is KNL architecture. Thus, it can be used to build software for CoolMUC-3. |
CoolMUC-4 | lxlogin5 | |
Teramem | lxlogin8 |
However, note that cross-submission of jobs to other cluster segments is also possible. The only thing you need to take care of is that different cluster segments support different instructions sets, so you need to make sure that your software build produces the appropriate binary that can execute on the targeted cluster segment.
Please do not run compute jobs on the login nodes! Instead, please choose the cluster and partition which fits your needs.
Documentation of SLURM
- SLURM Workload Manger (commands and links to examples).