General

When running a Job via Slurm, it is sometimes useful to monitor CPU and/or GPU utilization of your application while it is running. But direct SSH-Login to the compute nodes is not allowed. Here comes Slurm's srun utility to the rescue.

Attach an interactive bash session to a running Job (replace $JOBID with the ID of the Job to investigate)
srun --overlap --jobid $JOBID --pty bash

This session inherits/shares all requested CPU/Memory/GPU ressources and filesystem namespaces. This means you can also access a Job's private /tmp  or /ltmp folder this way.

Do NOT run any compute/memory intensive commands using this technique, or run anything in a tight (bash) loop, because resources are shared with your Job which might be slowed down. Additionally, the Load of the Node might increase artificially (because of oversubscription).

If your job runs on multiple nodes (typically MPI Jobs), you also need to specify the node:

Attach an interactive bash session to a running Job on a specific node (replace $JOBID with the ID of the Job to investigate)
srun --overlap --jobid $JOBID --nodelist=licca020 --pty bash

Investigate live CPU/Memory/Disk usage

Using the above command establishing an interacting bash session, simply use for example htop to monitor the live CPU and/or disk usage of your application. Of course you can also use different tools of your choice.

Restrict the processes in htop to your own processes
htop -u


Investigate live GPU/GPU-memory usage

Again, establish an interactive bash session attached to your Job first. Then use nvidia-smi to check GPU utilization and/or GPU memory utilization.

Check GPU utilization at a single point in time
nvidia-smi
RZBK@licca048:/tmp$ nvidia-smi
Thu Dec 14 16:11:25 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:21:00.0 Off |                    0 |
| N/A   55C    P0             194W / 300W |   1729MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    323292      C   ...sw/gmx/2023.3-gcc9-cuda11.6/bin/gmx     1714MiB |
+---------------------------------------------------------------------------------------+


This is an expensive utility which interrups your GPU calculation for a short time. Do NOT run nvidia-smi in a tight loop as it will slow down your calculation significantly.