File Systems of SuperMUC-NG
See also:
- Backup and Archive on SuperMUC-NG
- Best Practices, Hints and Optimizations for IO
- Data Science Storage for SuperMUC
- Data Science Archive for SuperMUC-NG
- Data Transfer Options on SuperMUC-NG
- SuperMUC-NG Archive Usage (TSM-based)
Technology
SuperMUC-NG integrates Lenovo DSS-G for IBM Spectrum Scale (aka GPFS) as building blocks for the storage. They are used for both the long-term storage and the high performance parallel file system.
File System Characteristics
Area | Purpose | Total Capacity | Aggregate Bandwidth |
---|---|---|---|
Home | Storage for user's source, input data, and small and important result files. | 256 TiB | ~25 GiB/s (SSD Tier) ~6 GiB/s (HDD Tier) |
Work | Large datasets that need to be kept on-disk medium or long term. Globally accessible from login and compute nodes. | 34 PiB | ~300 GiB/s |
Scratch | Temporary storage for large datasets (usually restart files, files to be pre-/postprocessed). Globally accessible from login and compute nodes. | 16 PiB | ~200 GiB/s |
DSS | Data Science Storage. Long term near-line storage for project's purposes and/or the science community. World wide access/transfer of this data via high performance WAN optimized transfer protocols, using a simple Graphical User Interface in the Web. Share data like LRZ Sync+Share, Dropbox or Google Drive.
| 20 PiB | ~70 GiB/s |
DSA | Data Science Archive. Long term offline storage for project's purposes and/or the science community. World wide access/transfer of this data via high performance of this data via high performance WAN optimized transfer protocols, using a simple Graphical User Interface in the Web.
| 260 PiB | ~10 GiB/s |
Node-local | /tmp on login and compute nodes. Resides in memory on compute nodes. Locally accessible only. Please do not use paths to this area explicitly (e.g. in scripts). TMPDIR (see below) can be used and will automatically be set to an appropriate value. | Small. A completely filled /tmp causes the node to become unusable. | varies |
File system access and policies
Upon login to the system or inside batch jobs, the environment module tempdir is loaded and supplies the necessary variable settings for file systems with exception of HOME.
Area | Environment Variable | Path pattern | Quota | Lifetime of Data | Data Safety/Integrity Measures |
---|---|---|---|---|---|
Home | $HOME | /dss/home/<hash>/<user> | 100 GB/user | Expiration of all projects an account is associated with | Nightly snapshots, kept for the last 7 days. Replication to secondary storage plus daily backup to tape |
Work | $WORK_<project> | $WORK_<project> | In accordance with project grant1. Project-level quota only. | End of specified project | None. See section below on archiving important data. |
Scratch | $SCRATCH | /hppfs/scratch/<hash>/<user> | 1 PB/user (safety measure) | Usually 3-4 weeks. Execution of deletion procedure depends on file system filling. | None. See section below on archiving important data. |
DSS | - | /dss/dssfs0[23]/<data-project>/<container> | Per data-project and container1 | End of data project | per-container policy. Regarding backup to tape archive: NONE, BACKUP_WEEKLY, BACKUP_DAILY (costs may arise for the user!) |
DSA | - | /dss/dsafs01/<bucket>/<container> | Limitation of number of files per data-project and container1 | End of data project | Data replicated on two tapes in two different sites. Metadata backed up daily. |
temporary | $TMPDIR | depends on availability of file systems, usually a subfolder of SCRATCH. /tmp is only used as a last measure cop-out. | depends on target file system | depends on target file system. | depends on target file system. |
1 Supplied value can be increased upon request. Please contact the Service Desk. |
File system usage
WORK and SCRATCH usage
With great power comes great responsibility! WORK and SCRATCH have a rather large block size of 16 MB, what is necessary for efficient IO on file systems of that size of Peta Bytes. This in turn means that many small files represent an inefficient use of such file systems.
Please adapt your work flows (bundle your small files and directory hierarchy; e.g. use mpifileutils
)!
While on WORK there is a quota, on SCRATCH there is none, yet. We explicitly don't want to limit the users ambitions. But this in turn requires some understanding and discipline from the users.
If you (possibly deviating from your intention in your project application, where we try to filter out inappropriate work flows on an early stage) need temporarily unprecedented resources like more than a Peta Byte of SCRATCH, or more than 10 mio. inodes (please count both, files and directories!), please inform us in the Service Desk! Specifically, the number of available inodes is necessarily limited on SCRATCH for performance reasons. Exceeding them will interrupt the system operation.
Although there is a gliding deletion policy, specifically if you temporarily occupied a lot resources, please clean as soon as possible to a reasonable level. SCRATCH is there for all users on the system! Please respect this, and try to reasonably minimize your resource consumption.
User's responsibility for saving important data
Having (parallel) file systems of several tens of petabyte, it is technically impossible (or too expensive) to backup these data automatically. Although the disks are protected by RAID mechanisms, other severe incidents might destroy the data. In most cases however, it is the user himself who incidently deletes or overwrites files. Therefore it is within the responsibility of the user to transfer data to more safe/secondary places and/or to archive them to tapes. Due to the long off-line times for dump and restoring of data, LRZ might not be able to recover data from any type of file outage/inconsistency of the SCRATCH or WORK filesystems. The alias name WORK and the intended storage period until the end of your project should not be misguided as an indication for the data safeness!
There is no automatic backup for SCRATCH and WORK. Beside automatic deletion, severe technical problems might destroy your data. It is your obligation to copy, transfer, or archive the files you want to keep!
Data after the end of project
Data will be deleted one year after the end of the project. However, for the data in DSS, DSA and the legacy archive, the project manager can request that the project is converted into a data-only project to gain further access to the archived data. Additionally, the project manager is warned by email after the project end that the data will be deleted.
Dos and don'ts, best practices, and notes on optimizations
The WORK and SCRATCH systems are tuned for high bandwidth, but it is not optimal for handling large quantities of small files located in a single directory with parallel accesses. In particular, generating more than ca. 1000 files per directory at approximately the same time from either a parallel program or from simultaneously running jobs will probably cause your application(s) to experience I/O errors (due to timeouts) and crashes. If you require this usage pattern, please generate a directory hierarchy with at most a few hundred files per subdirectory. See:
Temporary filesystem
Please use the environment variable $SCRATCH to access the temporary file system. This variable points to the location where the underlying file system will deliver optimal IO-Performance. Do not use /tmp or $TMPDIR
for storing temporary files! The file system where /tmp resides in memory is very small. Files will be regularly deleted by automatic procedures or sysadmins.
Coping with high watermark deletion in $SCRATCH
The high watermark deletion mechanism may remove files which are only a few days old if the file system is used heavily. In order to cope with this situation, please note:
- The normal
tar -x
command preserves the modification time of the original file and not the time when the archive has been unpacked. Therefore, files which have been unpacked from an older archive are one of the first candidates to be deleted. To prevent this, usetar -xm
to unpack your files, which will give them the actual date. - Please use the Backup and Archive system on SuperMUC-NG to archive/retrieve files from/to SCRATCH to/from the tape archive.
- Please always use $WORK or $SCRATCH for files which are considerably larger than 1 GB.
- Please remove any files which are not needed any more as soon as possible. The high watermark deletion procedure is then less likely to be triggered.
- More information about the filling of the file systems and about the oldest files will be made available on a web site in the near future.
Selecting the $WORK directory
Each project on SuperMUC-NG has a separate WORK directory with a shared quota for all users in this project. Users can select a specific WORK directory by applying the appropriate projectID e.g.,
export WORK=$WORK_<project>
in scripts or setting it in their .profile.
A colon seperated list of all WORK directories a user has access to is stored in the environment variable
echo $WORK_LIST
Sharing files with other users
Backup and Archive
Transferring files from/to other systems
We provide several options to move data from/to SuperMUC-NG. All of them have in common that the IP-Address of the remote machine must be first enabled in the SuperMUC-NG firewall.
Quotas and Access
To Display see your quota, use the following commands since the usual "quota" command will not work on the High Performance Parallel Files Systems.
- budget_and_quota or fullquota
For Information about the accessible DSS file systems and containers, use the following command on a login node. However, do not use it in a batch Job since it may block.
- dssusrinfo all
Parallel copy and rsync
Sometime it is necessary to copy or sync large amount (TBytes) of data for example from SCRATCH to WORK. Hint: use msrync, prsync or pexec to distribute the work onto more than one process or onto many cores.
Examples:
module load lrztools
#use 96 tasks on one node
msrsync -p 96 $SCRATCH/mydata $WORK/RESULTS/Experiment1
#use all processes within a parallel job
# generate the commands, make the directory structure, copy the data
prsync -f $SCRATCH/mydata -t WORK/RESULTS/Experiment1
source $HOME/.lrz_parallel_rsync/MKDIR
mpiexec -n 256 pexec $HOME/.lrz_parallel_rsync/RSYNCS
# exectue many copies in parallel
cat copylist
cp -r $SCRATCH/mydata/Exp1 $WORK/RESULTS
cp -r $SCRATCH/mydata/Exp2 $WORK/RESULTS
...
cp -r r $SCRATCH/mydata/Exp2000 $WORK/RESULTS
mpiexec -n 256 pexec copylist
Conversion of a SuperMUC project into a Data-Only Project (after project end)
Data in the tape archive will be deleted one year after the project end if the project is not converted into a data only project. However, the project manager can request that the project is converted into a data-only project to have further access to the archived data. The project manager is warned by email after the project end that the data will be deleted.
On request, it is possible to convert a SuperMUC project into a Data-Only project. Within such a Data-Only project the project manager is able to further retain and access the data once archived on tape, thus using the tape archive as a safe and reliable long term storage for the data generated by an SuperMUC project.
Data can than be accessed via the gateway node "tsmgw.abs.lrz.de" using the SuperMUC username and password of the project manager. Access to the server is possible via SSH with no restricitons on the IP address. However, access to SuperMUC itself is not possible after the end of a project. Currently, the server is equipped with a 37 TB local disk storage (/tsmtrans) to buffer the data retrieved from tape. There is a directory /tsmtrans/<username> where you can store the data and transfer them via scp.
The project manager can access all data of the project that are stored in the tape archive. Also, the password for accessing the tape archive (TSM Node) is not stored on the gateway node and must be set and remembered by the project manager.
When a SuperMUC project ends, the project manager will receive a reminder E-Mail, explaining the steps necessary to convert the project.
Further information
- Backup and Archive on SuperMUC-NG
- Best Practices, Hints and Optimizations for IO
- Data Science Storage for SuperMUC
- Data Science Archive for SuperMUC-NG
- Data Transfer Options on SuperMUC-NG
- SuperMUC-NG Archive Usage (TSM-based)
- Detailed doc for Data Science Storage
- Richtlinien zur Nutzung des Archiv- und Backupsystems
- Richtlinien für Data Science Storage
- Richtlinien zur Nutzung der Filesysteme und des Tapearchives an den Hoch- und Höchstleistungsrechnern