Parallel File System (GPFS)

Overview

Every HPC node of both LiCCA and ALCC has access to the same network filesystem /hpc/gpfs2 which is a shared ressource.

This filesystem contains the following folders, which currently share the same performance characteristics:

  • User home directory /hpc/gpfs2/home/u/$USER
  • User scratch directory /hpc/gpfs2/scratch/u/$USER
  • Group home directory  /hpc/gpfs2/home/g/$HPC-Projekt/

  • Group scratch directory /hpc/gpfs2/scratch/g/$HPC-Projekt/ 

Backup

All content of /hpc/gpfs2/home is backed up once a day to the Tape Library of the Rechenzentrum. All important data (e.g. results of calculations, user maintained software, etc.) is recommended to be stored in User home or Group directories.

Pro Tip: All data that can easily be recreated (e.g. temporary files, python evironments, etc.) should be stored in the User scratch directory (not part of the Backup).

Default Permissions and Ownerships for User and Group directories

Once Project and Cluster access have been approved, default permissions as well as user and group ownerships are applied to the four directories listed above. Permissions and ownerships of existing files and folders in these directories remain untouched.

User directories

  • Owner: personal user account name
  • Group: generic user group with only the above owner as a member
  • Permissions: 0750
  • No additional ACL (Access Control Lists)

These directories can only be accessed by the owner and nobody else (except the root user). Default permissions of newly created files and folders are 0644  and 0755 , respectively, due to the default umask setting of 0022 . This does not mean that other cluster user may access your files, because no regular user can get past your personal home and scratch directories, which act as gatekeepers.

Group/Project directories

  • Owner: root
  • Group: root
  • Permissions: 0750
  • Additional ACL
ACL
#NFSv4 ACL
#owner:root
#group:root
special:owner@:rwxc:allow
 (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     \
 (-)DELETE    (X)DELETE_CHILD (X)CHOWN        (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED    |
                                                                                                          |
special:group@:r-x-:allow                                                                                 |
 (X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |\ Standard 0750 permissions
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (X)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED    |/ for the root user
                                                                                                          |
special:everyone@:----:allow                                                                              |
 (-)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (-)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED    /

special:owner@:rwxc:allow:FileInherit:DirInherit:InheritOnly                                              \
 (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |
 (-)DELETE    (X)DELETE_CHILD (X)CHOWN        (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED    |
                                                                                                          |> ACL inherited by user created files and folders
group:rzhpc-<group>:rwxc:allow:FileInherit:DirInherit                                                     |  (does not apply to the group folder itself)
 (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED     |
 (X)DELETE    (X)DELETE_CHILD (X)CHOWN        (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED    /

These directories can (only) be accessed and modified by all group members. Files and directories created by one member can be aribrarily modified or removed by any other group member.

Note that user created files and folders in group directories won't have ACL an entry for special group  and other (everyone) permissions, therefore the last two mode bits (e.g. 700) or corresponding output of ls -l (e.g. -rwx------ ) is completely meaningless.

DO NOT attempt to "fix" file and folder permissions in group directories. Especially DO NOT run any kind of recursive chmod in group folders (e.g. chown -R ), even if you know what you are doing, because it is not necessary at all and will allocate useless extra metadata for every single file and folder.

Due to the nature of these ACL on group home and scratch directories, all files are marked as executable, and the output of ls may show all files with green color. Again, no need to fix this.

Granting Access to User and Group directories

User directories

DO NOT make your home or scratch folder world writable (e.g. chmod 777 ). This is explicitly forbidden and users doing so will receive a formal warning.

To grant readonly access for your home and/or scratch directory to a specific group:

Add an ACL entry for a rzhpc-* group
mmeditacl /hpc/gpfs2/home/u/$USER
- or -
mmeditacl /hpc/gpfs2/scratch/u/$USER

# Then append the following content and replace IDMGROUP with an existing rzhpc-* group:
group:IDMGROUP:r-x-:allow
 (X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (X)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED

The IdM group of choice should contain as few people as possible, because all members of this group will have read access to your personal home or scratch space this way. Recommendation: the respective rzhpc-* group of your project.

To grant readonly access for your home and/or scratch directory to a specific user:

Add an ACL entry for a single user
mmeditacl /hpc/gpfs2/home/u/$USER
- or -
mmeditacl /hpc/gpfs2/scratch/u/$USER

# Then append the following content and replace RZBK with the actual RZ user ID:
user:RZBK:r-x-:allow
 (X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL  (X)READ_ATTR  (X)READ_NAMED
 (-)DELETE    (-)DELETE_CHILD (-)CHOWN        (X)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED

Group/Project directories

You cannot modify the ACL of home and scratch group/project directories. To get access to another group's home or scratch folder you have to apply for Access to the Project Membership.


Local Node Filesystem

Every node provides a locally shared temporary directory /tmp  about 800G (shared) in size, provided by an enterprise grade local SSD drive. There is no quota enforced on this drive at the moment.

This directory is a private directory, it will only be seen by your Job.

Avoid using all of its space at once and allow other users to make use of the Local Node Filesystem as well.

Data retention policy

Data in /tmp will be deleted right after your Job terminates! Make sure that you copy back important files before your Job ends.

A typical Job using the Local Node Filesystem has at least three steps:

  1. Copy necessary data from GPFS to /tmp.
  2. Run your calculation there.
  3. Move results from the Node back to the GPFS.
Example (Script part only)
#!/usr/bin/env bash

#SBATCH options ...

# Step 1
TMP=/tmp
cp job.inp job.dat $TMP

# Step 2
### change to dir $TMP
pushd $TMP
srun your_application
### change back to the starting dir
popd

# Step 3
### move the results back, for example 'job_result.out' to your home directory ~/
mv $TMP/job_result.out ~/.

Care must be taken that SLURM logfiles are not copied to the Local Node Filesystem. This could lead to (in the worst case) Job crashes and will always be overwritten when copied back. Never use cp * $TMP !!

Handling Timelimit-situations for Jobs using the Local Node Filesystem.

If you are unsure how long your Job will take, it might run into the timelimit. Make sure you implement a mechanism to copy back important intermediate results in this case, because the private /tmp directory will be deleted right at the end (timeout or not) of a Job.

RAM disk (tmpfs)

Every Job can make use of a local RAM disk located at /dev/shm , which has a significantly higher performance (both I/O operations per seconds as well as bandwidth) as the filesystem on the local SSD disk. The usage is similar to the local SSD storage (see above). Contrary to disk storage, RAM disk storage requirements have to be added to the requested amount of RAM. The maximum size of the RAM disk is limited to approx. 50% of the total amount of RAM per node, i.e.  500G for nodes of the epyc and epyc-gpu nodes, and 2T for epyc-mem  nodes.

Given that your application requires 4G of RAM, and up to 8G of RAM disk storage will be used, you need to request at least #SBATCH --mem=12G of RAM. Failure to do so will result in your Job being terminated by the OOM (Out-Of-Memory) killer.

This directory is a private directory, it will only be seen by your Job.

Handling Timelimit-situations for Jobs using the RAM disk.

If you are unsure how long your Job will take, it might run into the timelimit. Make sure you implement a mechanism to copy back important intermediate results in this case, because the private /dev/shm directory will be deleted right at the end (timeout or not) of a Job.

Do not submit Jobs with significantly more that 8G per CPU core on the epyc partition. Use the epyc-mem partition for high memory applications instead.

Performance

The GPFS shows optimal performance with sequential read and write patterns (typically large files). Avoid random and high frequency access patterns (typically small files). Avoid the creation of large numbers of small files (>1000) in a single directory. Being a network filesystem there is always a small latency for every I/O operation involved during which your calculation remains idle. Since the GPFS is a shared ressource, the performance for all other users may vary and strongly depend on the filesystem load created by a single user either globally or within a single node.

If you cannot avoid highly frequent I/O operations, it is almost always much more efficient to use the local node filesystem or the RAM disk (see below).

In order to help you decide for the optimal storage for your use case we ran a couple of benchmarks.



Benchmark commands using fio
fio --rw=read --name=/hpc/gpfs2/u/$USER/test --size=50G
fio --rw=read --name=/hpc/gpfs2/u/$USER/test --size=50G --bs=4M
fio --rw=read --name=/home/ltmp/test --size=50G
fio --rw=read --name=/home/ltmp/test --size=50G --bs=4M
fio --rw=read --name=/dev/shm/test --size=50G
fio --rw=read --name=/dev/shm/test --size=50G --bs=4M

fio --rw=write --name=/hpc/gpfs2/u/$USER/test --size=50G
fio --rw=write --name=/hpc/gpfs2/u/$USER/test --size=50G --bs=4M
fio --rw=write --name=/home/ltmp/test --size=50G
fio --rw=write --name=/home/ltmp/test --size=50G --bs=4M
fio --rw=write --name=/dev/shm/test --size=50G
fio --rw=write --name=/dev/shm/test --size=50G --bs=4M

fio --rw=randread --name=/hpc/gpfs2/u/$USER/test --size=1G
fio --rw=randread --name=/hpc/gpfs2/u/$USER/test --size=1G --bs=4M
fio --rw=randread --name=/home/ltmp/test --size=5G
fio --rw=randread --name=/home/ltmp/test --size=5G --bs=4M
fio --rw=randread --name=/dev/shm/test --size=50G
fio --rw=randread --name=/dev/shm/test --size=50G --bs=4M

fio --rw=randwrite --name=/hpc/gpfs2/u/$USER/test --size=1G
fio --rw=randwrite --name=/hpc/gpfs2/u/$USER/test --size=1G --bs=4M
fio --rw=randwrite --name=/home/ltmp/test --size=5G
fio --rw=randwrite --name=/home/ltmp/test --size=5G --bs=4M
fio --rw=randwrite --name=/dev/shm/test --size=50G
fio --rw=randwrite --name=/dev/shm/test --size=50G --bs=4M