99. AI Systems Announcements

Latest Announcement

Infrastructure Power Cut

NOTICE: The AI Systems will be affected by an infrastructure power cut scheduled in November 2024. The following system partitions will become unavailable for 3 days during the specified time frame. We apologise for the inconvenience associated with that.

Calendar Week 46, 2024-11-11 - 2024-11-13

  • lrz-v100x2
  • lrz-hpe-p100x4
  • lrz-dgx-1-p100x8
  • lrz-dgx-1-v100x8
  • lrz-cpu (partly)
  • test-v100x2
  • lrz-hgx-a100-80x4
  • mcml-hgx-a100-80x4
  • mcml-hgx-a100-80x4-mig

Previous Announcements

Maintenance 2024-03 Changelog

Various system components have been updated during the maintenance procedure between September 30th and October 2nd, 2024:

Added:

  • New HGX-based nodes with H100 GPUs have been prepared, but are not yet user accessible - watch out for additional announcements

Changed:

  • Slurm Workload Manager has been updated to release 24.05.3
  • Configuration options affecting job submission limits, wall time and the fair share mechanism have been optimized and harmonized to facilitate higher throughput and reduced wait times (subject to ongoing adjustments)
  • Jupyter Notebook / JupyterLab container images have been updated and provide a new PyTorch version
  • The operating system kernel and packages of all AI Systems nodes have been been updated to recent point releases providing stability and security fixes
  • <MCML> Previous MIG nodes have been reconfigured and reallocated for general usage (for details see 3.0 Specifics for MCML Members)

Maintenance 2024-02 Changelog

Various system components have been updated during the maintenance procedure on July 1st-3rd, 2024:

Added:

  • Recent additions of HGX-based nodes with A100 GPUs have been finalized (lrz-hgx-a100-80x4 and mcml-hgx-a100-80x4 partitions)
  • Some additional CPU resources have been made available as part of the lrz-cpu partition

Changed:

  • The Enroot container runtime has been updated to release 3.5.0
  • The web-based frontend, Open OnDemand, has been updated to version 3.1.7
  • Jupyter Notebook / JupyterLab container images have been updated
  • RStudio Server container images have been updated and provide new R / RStudio Server versions
  • The operating system kernel and packages of all AI Systems nodes as well as the Nvidia drivers and GPFS storage applications have been been updated to recent point releases providing stability and security fixes

Removed

  •  RStudio Server container images with R versions prior to 4.4.0 have been removed; if absolutely necessary, these can still be provided and used as custom container images


Maintenance 2024-01 Changelog

Multiple system components have been updated and there are various user-facing changes that were introduced during the maintenance procedure on March 11th-14th, 2024:

Breaking:

  • enroot start currently cannot be used directly with a sqsh container image. Instead, it requires an existing container. The following commands show an example of how to create a container and use enroot start:
    enroot import <container-tag>  # when importing from a registry; skip if local image file is available
    enroot create --name <container-name> <image-file>  # -n; this step may have been skipped previously
    enroot start <container-name>
    Alternatively, use the Pyxis --container-image option when using srun or in the preamble of your batch script (for additional details see Removed section below).

Added:

  • A "Globus" button has been added to the file manager application of the web-based frontend and provides direct access to the active directory within the Globus research management portal. This allows for improved file management and data transfer capabilities (for further details see Using DSS world wide via Globus Online)

Changed:

  • The operating system of all AI Systems nodes has been updated to Ubuntu 22.04 LTS / Nvidia DGX OS 6; various DGX firmware components have been updated
  • The Nvidia drivers have been updated to version R535
  • The login infrastructure has been reworked and fully virtualized to provide increased stability, redundancy and future-proof flexibility
  • The web-based frontend, Open OnDemand, has been updated to release 3.1.1
  • Jupyter Notebook / JupyterLab container images have been updated and provide new PyTorch and TensorFlow versions
  • RStudio Server container images have been updated and provide new R / RStudio Server versions (older versions have been thinned out)

Removed:

  • Due to a bug in Ubuntu 22.04's fuse-overlay package, it had to be removed. This breaks the possibility to start container images directly without the need to create containers first, as had become possible in recent Kernel versions (see Enroot's documentation and Breaking section above). We are exploring and evaluating various options for a future course of action.

Maintenance 2023-04 Changelog

Various system components have been updated during the maintenance procedure on December 4th-6th, 2023:

  • General OS updates and system firmware updates
  • Slurm Workload Manager has been updated to release 23.02.6
  • The web-based frontend, Open OnDemand, has been updated to release 3.0.3

Maintenance 2023-03 Changelog

The following list of user-facing changes was introduced during the maintenance procedure between July 24th and 25th, 2023:

  • The primary address of the web-based frontend has been changed. Use login.ai.lrz.de for all connections to the LRZ AI Systems. All previous addresses may still be functional, but are going to be removed in the future (deprecation notice).
  • The available CPU options for interactive applications in the web-based frontend have been adjusted for some cases/usage combinations.
  • Jupyter Notebook/JupyterLab container images have been updated and provide a new PyTorch version.

Maintenance 2023-02 Changelog

The following list of user-facing changes was introduced during the maintenance procedure between June 5th and 7th, 2023:

  • The primary address of the SSH login node has been changed. Use login.ai.lrz.de for all SSH connections to the LRZ AI Systems. The previous address is not functional anymore (see deprecation notice below).
  • The NVIDIA drivers have been updated to version R525 for full compatibility with the recently released CUDA 12
  • The software component providing the web-based frontend, Open OnDemand, has been updated to release 3.0.1

Maintenance 2023-01 Changelog

Please note the following list of user-facing changes introduced during the maintenance procedure between March 13th and 15th, 2023:

  • The primary address of the SSH login node has been changed. Going forward, please use login.ai.lrz.de for all SSH connections to the LRZ AI Systems. The previous address is still functional, but will be removed in the future (deprecation notice).
  • TensorBoard has been added as new application to the available web servers of https://datalab3.srv.lrz.de

Maintenance 2022-04 Changelog

The resource selection for the OnDemand-based interactive apps (Jupyter Notebook, JupyterLab, RStudio Server) has been updated and unified. It does now allow for the allocation of single GPUs (in addition to combinations of CPU cores and RAM size) with all these front ends.

Reminder October 2022

The previous LRZ AI Systems home directories, accessible from the LRZ AI Systems login nodes under /home/<lrz-account> (read-only since the latest maintenance), have been decommissioned by 2022-10-31.

Maintenance 2022-03 Changelog

Please note the following user-facing changes to the LRZ AI Systems, which took effect during the latest maintenance:

Most importantly, availability of storage options on the LRZ AI (and MCML) Systems changed. The previous home directories have been superseded by the default Linux Cluster home directories. The very same files and data can now be accessed in the default home directories directly after login, irrespective of using a Linux Cluster or AI Systems login node, i.e. the LRZ AI Systems and LRZ Linux Cluster now provide unified home directories.

    • The previous LRZ AI Systems home directories are still accessible from the LRZ AI Systems login nodes under /home/<lrz-account> (read-only). These directories will be decommissioned by 2022-10-31, so make sure to copy your files into the new unified home directories as soon as possible!
    • In addition, the full offer of Data Science Storage (DSS) systems and containers can now directly be accessed from all LRZ AI Systems login and compute nodes.
    • Use the command dssusrinfo all on the login nodes to get an overview of all individually accessible DSS containers and their utilization.

For further details see Storage on the LRZ AI Systems