Introduction

For applications capable of scaling larger than a single GPU, there are several options. We showed already here that OpenMP is capable of addressing more than one GPU. Parallelism was driven via several threads (OpenMP parallel region). Of course, instead of doing offload via omp target, other programming models like SYCL/DPC++ or Kokkos can used. The latter two allow also of course also for device selection in one or another way.

But it might be more practical to simply use MPI as primary parallelism paradigm, and leave it to its runtime to assign devices to each rank. In a simple case, here illustrated, each rank has only one device (GPU/tile) assigned. If more than one device are assigned to a rank, a multi-gpu workflow per rank can be used as shown e.g. for OpenMP. But be aware that complexity and management of the parallel environment will grow substantially.

The other way around, using more ranks on a node than GPUs/tiles available, in order e.g. to exploit the CPUs on the host, might expose too much complexity to handle, in order to distribute workload in a load-balanced and efficient fashion. We will dwell on that elsewhere.

Let's keep it simple here, for the moment. One rank – one device (GPU/tile).

Realization in C++

Example OpenMP

Example SYCL/DPC++

Example Kokkos

Compilation

Test Run

Performance Analysis

Final Remark

MPI+OpenMP has the big advantage to also exploit a GPU-connecting network like XeLink (aka GPU-aware MPI). This could in principle also work with DPC++/SYCL and Kokkos.