DAOS File System
DAOS File System
Introduction
DAOS (Distributed Asynchronous Object Storage) is the high-performance storage tier of SuperMUC-NG Phase 2. Unlike traditional parallel file systems, the DAOS system at LRZ consists of 42 servers with NVMe SSDs and Intel Optane Persistent Memory to deliver high bandwidth and IOPS.
- Official Documentation: DAOS User Guide
- Status: Access is available upon request for I/O-intensive projects.
How DAOS is Different
If you are used to standard file systems like GPFS or Lustre, DAOS requires a slight shift in thinking.
1. Reserved vs. Shared Performance
Traditional file systems have a shared queue for metadata. If one user creates millions of files, everyone slows down.
DAOS uses Non-Volatile Memory (NVMe) and Intel Optane to provide independent, high-bandwidth pathways.
- For Biologists/AI: It excels at handling massive numbers of small files (metadata) without "choking."
- For Physicists: It provides massive bandwidth for checkpoints, minimizing the time jobs wait on I/O.
2. Storage Hierarchy: Pool vs. Container
You cannot just "mkdir" in the void. You need a specific allocation structure:
- The Pool ("Virtual Hardware"): A dedicated slice of storage capacity (NVMe + SCM) reserved for your project.
- Policy: At LRZ, all Pools are created with Redundancy Factor 2 (RF2) by default. This ensures data survives two simultaneous hardware failures.
- Allocation: There is no thin provisioning. If you request 50TB, it is fully reserved immediately.
- The Container ("The Filesystem"): A namespace inside your Pool where your files live.
- Inheritance: Because the Pool is RF2, any container you create inherits this protection automatically.
- Types: You usually create "POSIX" containers, which behave like standard Linux folders.
Access and Provisioning
Due to limited capacity (1.3 PB raw NVMe), DAOS is not available to all users by default.
- Requesting Access: Open a ticket at the Service Desk. Space is allocated per-user or per-project.
Creating Containers (Data Protection)
Once you have a pool, creating containers is an end-user task.
LRZ Policy: Data Protection (RF2)
By default, your Pool enforces Redundancy Factor 2 (RF2).
- Default Behavior: You do not need to specify redundancy options. The system automatically mirrors/stripes data to tolerate 2 hardware faults.
- Higher Protection: If you need RF3+, you can specify it at container creation.
- Lower Protection (Advanced): You cannot lower protection (e.g., to RF0) inside an RF2 pool. If you require raw speed with no data safety guarantees (risk of data loss), please consult the admins.
Creating a Standard POSIX Container:
# Syntax: daos cont create --type posix <POOL_LABEL> <CONTAINER_NAME>
# Note: RF2 is applied automatically by the pool default.
daos cont create --type posix <pool_label> ${USER}_cont01
Interactive Usage on Login Nodes
You can interactively mount DAOS containers on Login Nodes to check files, compile code, or reorganize directories. This requires the `dfuse` command.
CRITICAL: Mount Point Location
Never mount a DAOS container inside a GPFS directory (e.g., your `$HOME` or `$DSS` directories). This can cause severe performance degradation or system deadlocks.
Always create your mount point in `/tmp`.
How to Mount and Use
# 1. Create a local mount point in /tmp
mkdir -p /tmp/${USER}_daos
# 2. Mount the container (Interactive Dfuse)
# Syntax: dfuse -m <mount_point> --pool <pool_label> --cont <container_label>
dfuse -m /tmp/${USER}_daos --pool <pool_label> --cont ${USER}_cont01
# 3. Use standard commands
cd /tmp/${USER}_daos
ls -l
# You can now cp, mv, or edit files here.
How to Unmount
When you are finished, you must unmount the directory to clean up connections.
cd ~ # Move out of the directory first
fusermount -u /tmp/${USER}_daos
Data Movement
To benefit from DAOS performance, move data from GPFS (HOME/DSS) to DAOS before computation. The recommended tool is mpiFileUtils (dcp).
# Load necessary modules module load mpifileutils/0.11.1-intel24-impi-daos # Copy from GPFS to DAOS # Use 16MB blocksize for best performance with GPFS # Note the daos:// prefix used to bypass dfuse mounting mpirun -np 8 dcp --bufsize 16MB --chunksize 16MB /gpfs/path/to/data daos://<pool>/<cont>/path
Usage Modes and Interception
There are three ways to access data on compute nodes:
Mode | Description | Performance |
|---|---|---|
Dfuse | Mounts DAOS as a standard filesystem. Required for standard POSIX access (ls, cp). | Low/Moderate |
Interception | Uses | High |
Native | Uses DAOS API directly (MPI-IO, HDF5, mpiFileUtils). | Highest |
What is Interception?
By default, standard file operations (like read or write) must pass through the Linux operating system kernel. This adds significant overhead, especially for high-speed parallel storage.
Interception uses a special library (libpil4dfs) that sits between your application and the OS. It "intercepts" your file commands and redirects them straight to the DAOS network path, bypassing the slow kernel layers entirely. This allows standard apps to run much faster without code changes.
Native MPI-IO Support
DAOS includes an optimized ROMIO ADIO driver for MPI-IO, supported by the Intel MPI library. This allows MPI applications to achieve high performance by writing directly to DAOS objects.
- Official Documentation: MPI-IO Support
Native HDF5 Support
DAOS provides a "Virtual Object Layer" (VOL) connector for HDF5. This allows HDF5 applications to store data directly as DAOS objects (bypassing POSIX entirely) for maximum performance.
- Documentation: Using HDF5 with DAOS
- Usage: Requires linking against the DAOS VOL connector provided in the environment.
Slurm Job Scripts
DAOS runs in userspace; you must mount it (Dfuse) inside your job script.
LRZ provides utility scripts to assist with mounting/unmounting via mpiexec.
Job Template
#!/bin/bash
#SBATCH --job-name=daos_job
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --partition=general
#SBATCH --account=your_project_id
# 1. Environment
module switch stack stack/24.5.0
module load intel-toolkit/2025.0.1
# 2. Load Helper Functions
source /lrz/sys/tools/daos/daos-utils.sh
# 3. Configure Targets
# Replace <pool_label> with your assigned pool name
export MY_POOL="<pool_label>"
export MY_CONT="${USER}_cont01"
export MY_MOUNT="/tmp/${MY_POOL}/${MY_CONT}"
# 4. Mount (Dfuse)
# Required for Option A, optional for Option B if mixed access is needed
daos_mount $MY_MOUNT $MY_POOL $MY_CONT
echo "DAOS mounted at $MY_MOUNT"
# 5. Run Application
# --- OPTION A: POSIX with Interception (Standard Apps) ---
# Uses LD_PRELOAD to accelerate standard file I/O
export LD_PRELOAD="/usr/lib64/libpil4dfs.so"
mpiexec -n $SLURM_NTASKS ./my_posix_app $MY_MOUNT/input_file
# --- OPTION B: Native MPI-IO (High Performance) ---
# Bypasses the mount point completely.
# Requires an application compiled with MPI-IO support.
# Syntax: daos://<pool>/<container>/<path>
# mpiexec -n $SLURM_NTASKS ./my_mpi_app daos://${MY_POOL}/${MY_CONT}/input_file
# 6. Cleanup
daos_umount $MY_MOUNT
echo "Done."
Important Notes
- Unmounting: Always ensure
daos_umountis called. Slurm epilog tries to clean up, but manual verification helps avoid "????" permissions. - Working Directory: Do not submit jobs from a DAOS directory. Submit from GPFS, mount DAOS, then change directory.