Parallel File System (GPFS)

Overview

Every HPC node of both LiCCA and ALCC has access to the same network filesystem /hpc/gpfs2 which is a shared ressource.

This filesystem contains the following folders, which currently share the same performance characteristics:

  • User home directory /hpc/gpfs2/home/u/$USER
  • User scratch directory /hpc/gpfs2/scratch/u/$USER
  • Group home directory  /hpc/gpfs2/home/g/$HPC-Projekt/

  • Group scratch directory /hpc/gpfs2/scratch/g/$HPC-Projekt/ 

Backup

All content of /hpc/gpfs2/home is backed up once a day to the Tape Library of the Rechenzentrum. All important data (e.g. results of calculations, user maintained software, etc.) is recommended to be stored in User home or Group directories.

Pro Tip: All data that can easily be recreated (e.g. temporary files, python evironments, etc.) should be stored in the User scratch directory (not part of the Backup).

Default Permissions and Ownerships for User and Group directories

Once Project and Cluster access have been approved, default permissions as well as user and group ownerships are applied to the four directories listed above. Permissions and ownerships of existing files and folders in these directories remain untouched.

User directories

  • Owner: personal user account name
  • Group: generic user group with only the above owner as a member
  • Permissions: 0750
  • No additional ACL (Access Control Lists)

These directories can only be accessed by the owner and nobody else (except the root user). Default permissions of newly created files and folders are 0644  and 0755 , respectively, due to the default umask setting of 0022 . This does not mean that other cluster user may access your files, because no regular user can get past your personal home and scratch directories, which act as gatekeepers.

Group/Project directories

  • Owner: root
  • Group: root
  • Permissions: 0750
  • Additional ACL
ACL
#NFSv4 ACL
#owner:root
#group:root
special:owner@:rwxc:allow
 (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     \
 (-)DELETE    (X)DELETE_CHILD (X)CHOWN        (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED    |
                                                                                                          |
special:group@:r-x-:allow                                                                                 |
 (X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |\ Standard 0750 permissions
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (X)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED    |/ for the root user
                                                                                                          |
special:everyone@:----:allow                                                                              |
 (-)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (-)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED    /

special:owner@:rwxc:allow:FileInherit:DirInherit:InheritOnly                                              \
 (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |
 (-)DELETE    (X)DELETE_CHILD (X)CHOWN        (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED    |
                                                                                                          |> ACL inherited by user created files and folders
group:rzhpc-<group>:rwxc:allow:FileInherit:DirInherit                                                     |  (does not apply to the group folder itself)
 (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |
 (X)DELETE    (X)DELETE_CHILD (X)CHOWN        (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED    /

These directories can (only) be accessed and modified by all group members. Files and directories created by one member can be aribrarily modified or removed by any other group member.

Note that user created files and folders in group directories won't have ACL an entry for special group  and other (everyone) permissions, therefore the last two mode bits (e.g. 700) or corresponding output of ls -l (e.g. -rwx------ ) is completely meaningless.

DO NOT attempt to "fix" file and folder permissions in group directories. Especially DO NOT run any kind of recursive chmod in group folders (e.g. chown -R ), even if you know what you are doing, because it is not necessary at all and will allocate useless extra metadata for every single file and folder.

Due to the nature of these ACL on group home and scratch directories, all files are marked as executable, and the output of ls may show all files with green color. Again, no need to fix this.

Granting Access to User and Group directories

User directories

DO NOT make your home or scratch folder world writable (e.g. chmod 777 ). This is explicitly forbidden and users doing so will receive a formal warning.

To grant readonly access for your home and/or scratch directory to a specific group:

Add an ACL entry for a rzhpc-* group
mmeditacl /hpc/gpfs2/home/u/$USER
- or -
mmeditacl /hpc/gpfs2/scratch/u/$USER

# Then append the following content and replace IDMGROUP with an existing rzhpc-* group:
group:IDMGROUP:r-x-:allow
 (X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (X)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED

The IdM group of choice should contain as few people as possible, because all members of this group will have read access to your personal home or scratch space this way. Recommendation: the respective rzhpc-* group of your project.

To grant readonly access for your home and/or scratch directory to a specific user:

Add an ACL entry for a single user
mmeditacl /hpc/gpfs2/home/u/$USER
- or -
mmeditacl /hpc/gpfs2/scratch/u/$USER

# Then append the following content and replace RZBK with the actual RZ user ID:
user:RZBK:r-x-:allow
 (X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (X)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED

Group/Project directories

You cannot modify the ACL of home and scratch group/project directories. To get access to another group's home or scratch folder you have to apply for Access to the Project Membership.

Quota regulations and management

The GPFS filesystem is operated with quota enabled for the user and group directories in home and scratch. Users can check their current GPFS filesystem usage and quota situation on the login nodes with the command list-quota (/usr/local/bin/list-quota):

on LiCCA or ALCC login node
list-quota

with output of the following form

johndoe@licca001:~$ list-quota 
user quota: johndoe
                         Block Limits                                               |     File 
Filesystem Fileset    type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  
gpfs2      home       USR               0       512G       1.5T        80M     none |        5 2000000  6000000       38     none 
gpfs2      scratch    USR               0         1T         3T          0     none |        1 4000000 12000000        0     none 

There are quota set on the used block storage and also on inode usage (number of files and directories).

  • if none is stated under the column grace, everything is fine,
  • the current usage is listed for blocks (storage space) under blocks and for the number of files under files,
  • if the user has used more resources than those listed in the column quota and less than a hard limit, the time until the corresponding resource expires is stated under the column grace (for example: 28 days).

There are also quota set on the HPC-project-group directories in home and scratch.

on LiCCA or ALCC login node
list-quota -g

additionally shows the quota for all HPC-project-group directories where the user is a member.

johndoe@licca001:~$ list-quota -g
user quota: johndoe
                         Block Limits                                               |     File 
Filesystem Fileset    type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  
gpfs2      home       USR               0       512G       1.5T        80M     none |        5 2000000  6000000       38     none 
gpfs2      scratch    USR               0         1T         3T          0     none |        1 4000000 12000000        0     none 

group home fileset: home.g.test

                         Block Limits                                    |     File 
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  
gpfs2      FILESET           0       2.5T       7.5T          0     none |        1 10000000 30000000        0     none 

group scratch fileset: scratch.g.test

                         Block Limits                                    |     File 
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  
gpfs2      FILESET           0     5.039T     15.12T          0     none |        1 20000000 60000000        0     none 

You can exceed the quota for some time (the grace time) up to a hard limit (your quota times three).  The grace time (for block and inodes) is set to 30 days.

There is a quota monitoring running, which will send you a one-time "warning", once you exceed any of your quota . You will get a second message ("critical") if the grace time is under one week.

Please try to clean up your directories at this point at the latest. Open a ticket with our Service-desk, if this is a problem.

After the end of the grace time, no further writes are possible!


Local Node Filesystem

Every node provides a locally shared temporary directory /tmp  about 800G (shared) in size, provided by an enterprise grade local SSD drive. There is no quota enforced on this drive at the moment.

This directory is a private directory, it will only be seen by your Job.

Avoid using all of its space at once and allow other users to make use of the Local Node Filesystem as well.

Data retention policy

Data in /tmp will be deleted right after your Job terminates! Make sure that you copy back important files before your Job ends.

A typical Job using the Local Node Filesystem has at least three steps:

  1. Copy necessary data from GPFS to /tmp.
  2. Run your calculation there.
  3. Move results from the Node back to the GPFS.
Example (Script part only)
#!/usr/bin/env bash

#SBATCH options ...

# Step 1
TMP=/tmp
cp job.inp job.dat $TMP

# Step 2
### change to dir $TMP
pushd $TMP
srun your_application
### change back to the starting dir
popd

# Step 3
### move the results back, for example 'job_result.out' to your home directory ~/
mv $TMP/job_result.out ~/.

Care must be taken that SLURM logfiles are not copied to the Local Node Filesystem. This could lead to (in the worst case) Job crashes and will always be overwritten when copied back. Never use cp * $TMP !!

Handling Timelimit-situations for Jobs using the Local Node Filesystem.

If you are unsure how long your Job will take, it might run into the timelimit. Make sure you implement a mechanism to copy back important intermediate results in this case, because the private /tmp directory will be deleted right at the end (timeout or not) of a Job.

RAM disk (tmpfs)

Every Job can make use of a local RAM disk located at /dev/shm , which has a significantly higher performance (both I/O operations per seconds as well as bandwidth) as the filesystem on the local SSD disk. The usage is similar to the local SSD storage (see above). Contrary to disk storage, RAM disk storage requirements have to be added to the requested amount of RAM. The maximum size of the RAM disk is limited to approx. 50% of the total amount of RAM per node, i.e.  500G for nodes of the epyc and epyc-gpu nodes, and 2T for epyc-mem  nodes.

Given that your application requires 4G of RAM, and up to 8G of RAM disk storage will be used, you need to request at least #SBATCH --mem=12G of RAM. Failure to do so will result in your Job being terminated by the OOM (Out-Of-Memory) killer.

This directory is a private directory, it will only be seen by your Job.

Handling Timelimit-situations for Jobs using the RAM disk.

If you are unsure how long your Job will take, it might run into the timelimit. Make sure you implement a mechanism to copy back important intermediate results in this case, because the private /dev/shm directory will be deleted right at the end (timeout or not) of a Job.

Do not submit Jobs with significantly more that 8G per CPU core on the epyc partition. Use the epyc-mem partition for high memory applications instead.

Performance

The GPFS shows optimal performance with sequential read and write patterns (typically large files). Avoid random and high frequency access patterns (typically small files). Avoid the creation of large numbers of small files (>1000) in a single directory. Being a network filesystem there is always a small latency for every I/O operation involved during which your calculation remains idle. Since the GPFS is a shared ressource, the performance for all other users may vary and strongly depend on the filesystem load created by a single user either globally or within a single node.

If you cannot avoid highly frequent I/O operations, it is almost always much more efficient to use the local node filesystem or the RAM disk (see below).

In order to help you decide for the optimal storage for your use case we ran a couple of benchmarks.



Benchmark commands using fio
fio --rw=read --name=/hpc/gpfs2/u/$USER/test --size=50G
fio --rw=read --name=/hpc/gpfs2/u/$USER/test --size=50G --bs=4M
fio --rw=read --name=/home/ltmp/test --size=50G
fio --rw=read --name=/home/ltmp/test --size=50G --bs=4M
fio --rw=read --name=/dev/shm/test --size=50G
fio --rw=read --name=/dev/shm/test --size=50G --bs=4M

fio --rw=write --name=/hpc/gpfs2/u/$USER/test --size=50G
fio --rw=write --name=/hpc/gpfs2/u/$USER/test --size=50G --bs=4M
fio --rw=write --name=/home/ltmp/test --size=50G
fio --rw=write --name=/home/ltmp/test --size=50G --bs=4M
fio --rw=write --name=/dev/shm/test --size=50G
fio --rw=write --name=/dev/shm/test --size=50G --bs=4M

fio --rw=randread --name=/hpc/gpfs2/u/$USER/test --size=1G
fio --rw=randread --name=/hpc/gpfs2/u/$USER/test --size=1G --bs=4M
fio --rw=randread --name=/home/ltmp/test --size=5G
fio --rw=randread --name=/home/ltmp/test --size=5G --bs=4M
fio --rw=randread --name=/dev/shm/test --size=50G
fio --rw=randread --name=/dev/shm/test --size=50G --bs=4M

fio --rw=randwrite --name=/hpc/gpfs2/u/$USER/test --size=1G
fio --rw=randwrite --name=/hpc/gpfs2/u/$USER/test --size=1G --bs=4M
fio --rw=randwrite --name=/home/ltmp/test --size=5G
fio --rw=randwrite --name=/home/ltmp/test --size=5G --bs=4M
fio --rw=randwrite --name=/dev/shm/test --size=50G
fio --rw=randwrite --name=/dev/shm/test --size=50G --bs=4M