PRACE Course: HPC Code Optimisation Workshop 2022
Contents
In the ever-growing complexity of computer architectures, code optimisation has become the main route to keep pace with hardware advancements and effectively make use of current and upcoming High Performance Computing systems.
Have you ever asked yourself:
- Where are the performance bottlenecks of my application?
- What is the maximum speed-up achievable on the architecture I am using?
- Does my code scale well across multiple machines?
- Does my implementation match my HPC objectives?
In this workshop, we will discuss these questions and provide a unique opportunity to learn techniques, methods and solutions on how to improve code, how to enable the new hardware features and how to use visualise the potential benefits of an optimisation process.
We will describe the latest micro-processor architectures and how developers can efficiently use modern HPC hardware, including SIMD vector units and the memory hierarchy. We will also touch upon exploiting intra-node and inter-node parallelism.
Attendees will be guided along the optimisation process through the incremental improvement of an example application. Through hands-on exercises they will learn how to enable vectorisation using simple pragmas and more effective techniques like changing data layout and alignment.
The work is guided by hints from compiler reports, and profiling tools such as Intel® Advisor, Intel® VTune™ Amplifier, Intel® Application Performance Snapshot and LIKWID for investigating and improving the performance of an HPC application.
You can ask the lecturers in the Q&A session about how to optimise your code. Please provide a description of your code in the registration form.
Learning Goals
Through a sequence of simple, guided examples of code modernisation, the attendees will develop awareness on features of multi and many-core architecture which are crucial for writing modern, portable and efficient applications.
A special focus will be dedicated to scalar and vector optimisations for the Intel® Xeon® Scalable processor, code-named Skylake, utilised in the SuperMUC-NG machine at LRZ.
The workshop interleaves lecture and practical sessions.
Preliminary Agenda
Session | |
1st day morning | Intro (Volker Weinberg) |
1st day afternoon | HPC Architecture, Vectorization |
2nd day morning | Profiling: Code instrumentation, Roofline Model, Intel Advisor (Jonathan Coles) |
2nd day afternoon | Debuggers (Gerald Mathias) |
3rd day morning | LikWid (Carla Guillen/Thomas Gruber) |
3rd day afternoon | Optimisation highlights by LRZ (CXS Group LRZ) |
The workshop is a PRACE training event organised by LRZ in cooperation with NHR@FAU .
Lecturers
Dr. Patrick Böhl, Dr. Jonathan Coles, Dr. Gerald Mathias , Dr. Carla Guillen, Nisarg Patel, Dr. Josef Weidendorfer (LRZ)
Thomas Gruber (NHR@FAU)
Slides and Exercises
Recommended Access Tools
- Exercises will be done on the CooLMUC2 Cluster @ LRZ with 28-way Haswell-based nodes and FDR14 Infiniband interconnect
- Please use your own laptop or PC with X11 support and an ssh client installed for the hands-on sessions.
Under Windows
- Install and run the Xming X11 Server for Windows: https://sourceforge.net/projects/xming/ and then install and run the terminal software putty: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
- Alternatively, we recommend to install the comfortable tool MobaXterm (https://mobaxterm.mobatek.net/download-home-edition.html) which also includes an X11 client.
- Under macOS
- Install X11 support for macOS XQuartz: https://www.xquartz.org/
- Under Linux
- ssh and X11 support comes with all distributions
Login under Windows:
- Start xming and after that PUTTY
- Enter host name lxlogin1.lrz.de into the putty host field and click Open.
- Accept & save host key [only first time]
- Enter user name and password (provided by LRZ staff) into the opened console.
Login under Mac:
- Install X11 support for MacOS XQuartz: https://www.xquartz.org/
- Open Terminal
- ssh -Y lxlogin1.lrz.de -l username
- Use user name and password (provided by LRZ staff)
Login under Linux:
- Open xterm
- ssh -Y lxlogin1.lrz.de -l username
- Use user name and password (provided by LRZ staff)
How to use the CoolMUC-2 System
Login Nodes:
Reservation is only valid during the workshop, for general usage on our Linux Cluster remove the "--reservation=hcow1s22
"
- Submit a job:
sbatch --reservation=hcow1s22 job.sh
- List own jobs:
squeue -M cm2
- Cancel jobs:
scancel -M cm2 jobid
- Show reservations:
sinfo -M cm2 --reservation
- Interactive Access:
salloc -M cm2 --time=00:30:00 --reservation=hcow1s22 --partition=cm2_std
Details: https://doku.lrz.de/display/PUBLIC/Running+parallel+jobs+on+the+Linux-Cluster
Examples: https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster
Resource limits: https://doku.lrz.de/display/PUBLIC/Resource+limits+for+parallel+jobs+on+Linux+Cluster
Example OpenMP Batch File
#!/bin/bash
#SBATCH -o /dss/dsshome1/0D/hpckurs99/test.%j.%N.out
#SBATCH -D/dss/dsshome1/0D/hpckurs99
#SBATCH -J test
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=1
#SBATCH --qos=unlimitnodes
#SBATCH --cpus-per-task=28
#SBATCH --get-user-env
#SBATCH --reservation=hcow1s22
#SBATCH --time=02:00:00
module load slurm_setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo hello, world
Intel Software Stack
The Intel software stack is automatically loaded at login. The Intel compilers are called icc (for C), icpc (for C++) and ifort (for Fortran). They behave similar to the GNU compiler suite (option –help shows an option summary). For reasonable optimisation including SIMD vectorisation, use options -O3 -xavx (you can use -O2 instead of -O3 and sometimes get better results, since the compiler will sometimes try be overly smart and undo many of your hand-coded optimizations).
By default, OpenMP directives in your code are ignored. Use the -qopenmp option to activate OpenMP.
Use mpiexec -n #tasks to run MPI programs. The compiler wrappers' names follow the usual mpicc, mpifort, mpiCC pattern.
Intel OneAPI
The most recent version of the Intel software stack "Intel OneAPI" can be loaded with
Intel OneAPI software stack
|
Upon loading the main intel-oneapi
module, the default modules intel
, intel-mpi
, and intel-mkl
are unloaded and replaced by the intel-oneapi-*
variants. Further intel-oneapi-xxx
modules are available via the module command.
PRACE Survey
Please fill out the PRACE online survey under
tbd.
This helps us and PRACE to increase the quality of the courses, design the future training programme at LRZ and in Europe according to your needs and wishes, get future funding for training events.