Playbook for using VTune tool on devCloud or other Clusters ----------------------------------------------------------- Note: command lines start with "$" prompt. 1-4 are about Devcloud usage. 5-7 show the test application nbody. 8 aps 9+ VTune 1. Log into DevCloud -------------------- $ ssh devcloud Alternative: open a jupyter notebook and start the terminal 2. Clone Samples GitHub ------------------------ $ git clone https://github.com/oneapi-src/oneAPI-samples.git 3. Start interactive session on a node with GEN11 GPU ------------------------------------------------------ (People with NDA accounts may use ATS-P gpu) it is better to compile on compute node because login node has very limited memory etc. $ qsub -I -l nodes=1:gen11:ppn=2 4. Check properties ------------------- $ sycl-ls --verbose Platform [#3]: Version : OpenCL 3.0 Name : Intel(R) OpenCL HD Graphics Vendor : Intel(R) Corporation Devices : 1 Device [#2]: Type : gpu Version : 3.0 Name : Intel(R) UHD Graphics [0x9a60] Vendor : Intel(R) Corporation Driver : 22.23.23405 Platform [#4]: Version : 1.3 Name : Intel(R) Level-Zero Vendor : Intel(R) Corporation Devices : 1 Device [#0]: Type : gpu Version : 1.3 Name : Intel(R) UHD Graphics [0x9a60] Vendor : Intel(R) Corporation Driver : 1.3.23405 prints out all backends (GPU device + low level driver level_zero or opencl) $ clinfo provides more details for opencl backend. 5. Build nbody code from oneAPI-samples --------------------------------------- $ git clone https://github.com/oneapi-src/oneAPI-samples.git $ mkdir build $ cd build $ cmake ../oneAPI-samples/DirectProgramming/DPC++/N-BodyMethods/Nbody/ $ make 6. Run nbody ------------ $ ./src/nbody output should look like: =============================== Initialize Gravity Simulation nPart = 16000; nSteps = 10; dt = 0.1 ------------------------------------------------ s dt kenergy time (s) GFLOPS ------------------------------------------------ 1 0.1 26.405 0.19124 38.821 2 0.2 313.77 0.006551 1133.3 3 0.3 926.56 0.0066749 1112.3 4 0.4 1866.4 0.0066208 1121.4 5 0.5 3135.6 0.0065561 1132.4 6 0.6 4737.6 0.0066551 1115.6 7 0.7 6676.6 0.0066353 1118.9 8 0.8 8957.7 0.0065615 1131.5 9 0.9 11587 0.0066486 1116.7 10 1 14572 0.006616 1122.2 # Total Time (s) : 0.25087 # Average Performance : 1121.4 +- 6.799 =============================== 7. (optional) change number of particles to 256000 --------------------------------------------------- $ vi ../oneAPI-samples/DirectProgramming/DPC++/N-BodyMethods/Nbody/src/GSimulation.cpp change line : set_npart(16000); to line : set_npart(256000); $ make clean $ make $ ./src/nbody Note: use $ export VERBOSE=1 to see build steps in detail 8. (Application Performance Snapshot) APS usage ------------------------------------------------ help menu $ aps -help run with aps $ aps nbody generates ascii output and HTML more options for mpi scaling found in $ aps-report -help APS shows high occupancy on GPU. The advice is to use VTune to discover why the CPU is underutilized. But we did not intend to do computation on the CPU. 9. VTune on Devcloud -------------------- generate a ssh tunnel to the compute node and start the vtune-backend server. allocate a node as described before: e.g. node with DG1 GPU : s011-n001 log to devcloud again and use port 55001 for a tunnel $ ssh -L 127.0.0.1:55001:127.0.0.1:55001 devcloud extend tunnel to allocated node $ ssh -L 127.0.0.1:55001:127.0.0.1:55001 s011-n001 start vtune backend server $ vtune-backend --web-port=55001 --enable-server-profiling provides you a line to use in your local web browser: Serving GUI at https://127.0.0.1:55001 first time it will have a longer line with certificate. Use the whole line! first time you will be asked for a passphrase. Please add a good password and remember it. Web browser will now show the VTune GUI. start analysis by "Configure Analysis" see also: https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/using-vtune-server-with-vs-code-intel-devcloud.html 10. VTune command line ---------------------- run VTune command line and copy the result directories into the default VTune Projects directory: $ ls $HOME/intel/vtune/projects Generate new dir like MY-NBODY inside projects and copy results to it. Alternative: start vtune-backend with parameter --data-directory 11 HPC Analysis --------------- $ vtune -c hpc-performance good for OpenMP analysis but no OMP in nbody! add memory analysis. $ vtune -c hpc-performance -knob collect-memory-bandwidth=true 12 GPU Hotspots --------------- plain first analysis: $ vtune -c gpu-hotspots -r gh -- Result data inside directory gh Full instrumentation of instruction (very high overhead): $ vtune -c gpu-hotspots -knob characterization-mode=instruction-count -r ghi -- GPU source analysis: $ vtune -c gpu-hotspots -knob profiling-mode=source-analysis -r ghs -- estimation of timing per source line (measures basic blocks) $ vtune -c gpu-hotspots -knob profiling-mode=source-analysis -knob source-analysis=mem-latency -r ghl -- shows latencies per source line