3.0 Specifics for MCML Members
Table of Contents
Access Management for MCML PIs
MCML PIs who want (members of their research group) to access the MCML partition (as part of the LRZ AI Systems), please open a service request with LRZ Servicedesk here
and choose/add
- Type: Service Request
- Description: "Access to MCML Segment @ LRZ AI Systems";
- Details:
- Please specify the name of the MCML PI
- If applicable (i.e. if the working group already has one), please provide a LRZ Master User/Linux Cluster project ID
Further instructions will follow via the communication in the service request ticket.
Once a suitable LRZ Master User/Linux Cluster project has been created/set up as a "MCML project", any account within this project which gets Linux Cluster permissions assigned (by the Master Users) will automatically also be granted access to LRZ AI Systems, including the MCML partition of this system. Similarly, removing Linux Cluster permissions from individual accounts will, also revoke LRZ AI Systems permissions, including the MCML partition.
Dedicated Compute Hardware
In addition to the generally available resources listed on General Description and Resources, the user IDs associated with the LRZ projects of MCML research groups are entitled to use the following hardware. Access will be granted automatically (for dedicated MCML LRZ projects) or upon request (for eligible accounts of pre-existing LRZ Master User projects). The allocation time limit for individual jobs is 2 days (2-00:00:00).
Slurm Partition | Number of nodes | CPUs per node | Memory per node | GPUs per node | Memory per GPU | |
---|---|---|---|---|---|---|
HGX H100 Architecture | mcml-hgx-h100-92x4 | 21 | 96 | 768 GB | 4 NVIDIA H100 | 94 GB |
HGX A100 Architecture | mcml-hgx-a100-80x4 | 21 | 96 | 1 TB | 4 NVIDIA A100 | 80 GB |
DGX A100 Architecture | mcml-dgx-a100-40x8 | 8 | 256 | 1 TB | 8 NVIDIA A100 | 40 GB |
In order to use resources of the mcml-dgx-a100-40x partition, eligible users need to specify the "mcml" quality of service (QoS) for their job allocation and/or submission, e.g.
$ salloc -p mcml-dgx-a100-40x8 -q mcml -n 8 --gres=gpu:8 # short form options, where available $ salloc --partition=mcml-dgx-a100-40x8 --qos=mcml --ntasks=8 --gres=gpu:8 # long form options
In the same way, to use resources of the mcml-hgx-a100-80x4 or mcml-hgx-h100-92x4 partitions, eligible users need to specify the "mcml" quality of service (QoS) for their job allocation and/or submission, e.g.
$ salloc -p mcml-hgx-a100-80x4 -q mcml -n 4 --gres=gpu:4 # short form options, where available $ salloc --partition=mcml-hgx-a100-80x4 --qos=mcml --ntasks=4 --gres=gpu:4 # long form options
Smaller Scale Resources / Multi Instance GPU Mode
Additionally, via the NVIDIA's Multi Instance GPU (MIG) mode (NVIDIA Multi-Instance GPU User Guide) there are a couple of smaller scale resources available ("virtual GPU instances"). Some A100 GPUs have been 'partitioned' into slices which can be offered as virtual GPU instances. MIG divides each card into 7 slices, which can be combined in different ways. The following table indicates how these slices are combined for different mcml nodes in the mcml-hgx-a100-80x4-mig
partition.
Slurm Partition | Number of nodes | GPUs per node | MIGs mode per GPU |
---|---|---|---|
mcml-hgx-a100-80x4-mig | 3 | 4 NVIDIA A100 | 3 / 2 / 1 / 1 |
The way to interpret the table above is as follows. The first row indicates that there are five nodes whose GPUs are partitioned in three virtual GPU instances each: one with three slices out seven, and two with two slices each. The second row, indicates there are two nodes whose GPUs are partitioned in four virtual GPU instances each: one with three slices, one with two slices, and two with one slice.
In case you want to allocate one instance with three slices (i.e., three seventh the capacity of a full A100) the following code block shows an example.
$ salloc -p mcml-hgx-a100-80x4-mig -q mcml --gres=gpu:3g # short form options, where available $ salloc --partition=mcml-hgx-a100-80x48-mig --qos=mcml --gres=gpu:3g # long form options
Please be aware that only single slices can be used for an individual job. The MIG mode does not support 'multi-GPU' computing.
Dedicated Storage Options
As indicated on Storage on the LRZ AI Systems, there is a dedicated MCML DSS system (/dssmcmlfs01) available to eligible users. This is high-performance, SSD-based network storage. The system is intended for high-bandwidth, low-latency I/O operations, serving the demands of modern day AI applications.
The Master Users of eligible LRZ projects (i.e. MCML research groups) are welcome to submit a service request to LRZ Servicedesk asking for storage on the MCML DSS system. By using this link, they will open a ticket for the appropriate service. Select "AI topics", next "Master user only: Application for project storage space (DSS AI)" and confirm. Fill in the form and make sure to note that the request is for MCML DSS. Finally, submit the form.
A quota of up to 5 TB, 10.000.000 files and a maximum of 5 DSS containers (commonly: a single container) will typically be assigned (a quota of up to 10 TB, 20.000.000 files can be granted upon explicit request & proof of demand). Once granted, the Master User will subsequently act as DSS Data Curator and manage the assigned storage quotas for their project.