Parallel File System (GPFS)
Overview
Every HPC node of both LiCCA and ALCC has access to the same network filesystem /hpc/gpfs2
which is a shared ressource.
This filesystem contains the following folders, which currently share the same performance characteristics:
- User home directory
/hpc/gpfs2/home/u/$USER
- User scratch directory
/hpc/gpfs2/scratch/u/$USER
Group home directory
/hpc/gpfs2/home/g/$HPC-Projekt/
- Group scratch directory
/hpc/gpfs2/scratch/g/$HPC-Projekt/
Backup
All content of /hpc/gpfs2/home
is backed up once a day to the Tape Library of the Rechenzentrum. All important data (e.g. results of calculations, user maintained software, etc.) is recommended to be stored in User home or Group directories.
Pro Tip: All data that can easily be recreated (e.g. temporary files, python evironments, etc.) should be stored in the User scratch directory (not part of the Backup).
Default Permissions and Ownerships for User and Group directories
Once Project and Cluster access have been approved, default permissions as well as user and group ownerships are applied to the four directories listed above. Permissions and ownerships of existing files and folders in these directories remain untouched.
User directories
- Owner: personal user account name
- Group: generic user group with only the above owner as a member
- Permissions: 0750
- No additional ACL (Access Control Lists)
These directories can only be accessed by the owner and nobody else (except the root user). Default permissions of newly created files and folders are 0644
and 0755
, respectively, due to the default umask setting of 0022
. This does not mean that other cluster user may access your files, because no regular user can get past your personal home and scratch directories, which act as gatekeepers.
Group/Project directories
- Owner: root
- Group: root
- Permissions: 0750
- Additional ACL
These directories can (only) be accessed and modified by all group members. Files and directories created by one member can be aribrarily modified or removed by any other group member.
Note that user created files and folders in group directories won't have ACL an entry for special group
and other
(everyone) permissions, therefore the last two mode bits (e.g. 700) or corresponding output of ls -l
(e.g. -rwx------
) is completely meaningless.
DO NOT attempt to "fix" file and folder permissions in group directories. Especially DO NOT run any kind of recursive chmod
in group folders (e.g. chown -R
), even if you know what you are doing, because it is not necessary at all and will allocate useless extra metadata for every single file and folder.
Due to the nature of these ACL on group home and scratch directories, all files are marked as executable, and the output of ls
may show all files with green color. Again, no need to fix this.
Granting Access to User and Group directories
User directories
DO NOT make your home or scratch folder world writable (e.g. chmod 777
). This is explicitly forbidden and users doing so will receive a formal warning.
To grant readonly access for your home and/or scratch directory to a specific group:
The IdM group of choice should contain as few people as possible, because all members of this group will have read access to your personal home or scratch space this way. Recommendation: the respective rzhpc-*
group of your project.
To grant readonly access for your home and/or scratch directory to a specific user:
Group/Project directories
You cannot modify the ACL of home and scratch group/project directories. To get access to another group's home or scratch folder you have to apply for Access to the Project Membership.
Quota regulations and management
The GPFS filesystem is operated with quota enabled for the user and group directories in home and scratch. Users can check their current GPFS filesystem usage and quota situation on the login nodes with the command list-quota
(/usr/local/bin/list-quota
):
list-quota
with output of the following form
johndoe@licca001:~$ list-quota user quota: johndoe Block Limits | File Filesystem Fileset type blocks quota limit in_doubt grace | files quota limit in_doubt grace gpfs2 home USR 0 512G 1.5T 80M none | 5 2000000 6000000 38 none gpfs2 scratch USR 0 1T 3T 0 none | 1 4000000 12000000 0 none
There are quota set on the used block storage and also on inode usage (number of files and directories).
- if
none
is stated under the columngrace
, everything is fine, - the current usage is listed for blocks (storage space) under
blocks
and for the number of files underfiles
, - if the user has used more resources than those listed in the column
quota
and less than a hardlimit
, the time until the corresponding resource expires is stated under the columngrace
(for example: 28 days).
There are also quota set on the HPC-project-group directories in home and scratch.
list-quota -g
additionally shows the quota for all HPC-project-group directories where the user is a member.
johndoe@licca001:~$ list-quota -g user quota: johndoe Block Limits | File Filesystem Fileset type blocks quota limit in_doubt grace | files quota limit in_doubt grace gpfs2 home USR 0 512G 1.5T 80M none | 5 2000000 6000000 38 none gpfs2 scratch USR 0 1T 3T 0 none | 1 4000000 12000000 0 none group home fileset: home.g.test Block Limits | File Filesystem type blocks quota limit in_doubt grace | files quota limit in_doubt grace gpfs2 FILESET 0 2.5T 7.5T 0 none | 1 10000000 30000000 0 none group scratch fileset: scratch.g.test Block Limits | File Filesystem type blocks quota limit in_doubt grace | files quota limit in_doubt grace gpfs2 FILESET 0 5.039T 15.12T 0 none | 1 20000000 60000000 0 none
You can exceed the quota for some time (the grace time) up to a hard limit (your quota times three). The grace time (for block and inodes) is set to 30 days.
There is a quota monitoring running, which will send you a one-time "warning", once you exceed any of your quota . You will get a second message ("critical") if the grace time is under one week.
Please try to clean up your directories at this point at the latest. Open a ticket with our Service-desk, if this is a problem.
After the end of the grace time, no further writes are possible!
Local Node Filesystem
Every node provides a locally shared temporary directory /tmp
about 800G (shared) in size, provided by an enterprise grade local SSD drive. There is no quota enforced on this drive at the moment.
This directory is a private directory, it will only be seen by your Job.
Avoid using all of its space at once and allow other users to make use of the Local Node Filesystem as well.
Data retention policy
Data in /tmp
will be deleted right after your Job terminates! Make sure that you copy back important files before your Job ends.
A typical Job using the Local Node Filesystem has at least three steps:
- Copy necessary data from GPFS to /tmp.
- Run your calculation there.
- Move results from the Node back to the GPFS.
Care must be taken that SLURM logfiles are not copied to the Local Node Filesystem. This could lead to (in the worst case) Job crashes and will always be overwritten when copied back. Never use cp * $TMP
!!
Handling Timelimit-situations for Jobs using the Local Node Filesystem.
If you are unsure how long your Job will take, it might run into the timelimit. Make sure you implement a mechanism to copy back important intermediate results in this case, because the private /tmp
directory will be deleted right at the end (timeout or not) of a Job.
RAM disk (tmpfs)
Every Job can make use of a local RAM disk located at /dev/shm
, which has a significantly higher performance (both I/O operations per seconds as well as bandwidth) as the filesystem on the local SSD disk. The usage is similar to the local SSD storage (see above). Contrary to disk storage, RAM disk storage requirements have to be added to the requested amount of RAM. The maximum size of the RAM disk is limited to approx. 50% of the total amount of RAM per node, i.e. 500G for nodes of the epyc
and epyc-gpu
nodes, and 2T for epyc-mem
nodes.
#SBATCH --mem=12G
of RAM. Failure to do so will result in your Job being terminated by the OOM (Out-Of-Memory) killer.This directory is a private directory, it will only be seen by your Job.
Handling Timelimit-situations for Jobs using the RAM disk.
If you are unsure how long your Job will take, it might run into the timelimit. Make sure you implement a mechanism to copy back important intermediate results in this case, because the private /dev/shm
directory will be deleted right at the end (timeout or not) of a Job.
Do not submit Jobs with significantly more that 8G per CPU core on the epyc
partition. Use the epyc-mem
partition for high memory applications instead.
Performance
The GPFS shows optimal performance with sequential read and write patterns (typically large files). Avoid random and high frequency access patterns (typically small files). Avoid the creation of large numbers of small files (>1000) in a single directory. Being a network filesystem there is always a small latency for every I/O operation involved during which your calculation remains idle. Since the GPFS is a shared ressource, the performance for all other users may vary and strongly depend on the filesystem load created by a single user either globally or within a single node.
If you cannot avoid highly frequent I/O operations, it is almost always much more efficient to use the local node filesystem or the RAM disk (see below).
In order to help you decide for the optimal storage for your use case we ran a couple of benchmarks.