General
When running a Job via Slurm, it is sometimes useful to monitor CPU and/or GPU utilization of your application while it is running. But direct SSH-Login to the compute nodes is not allowed. Here comes Slurm's srun
utility to the rescue.
srun --overlap --jobid $JOBID --pty bash
This session inherits/shares all requested CPU/Memory/GPU ressources and filesystem namespaces. This means you can also access a Job's private /tmp
or /ltmp
folder this way.
Do NOT run any compute/memory intensive commands using this technique, or run anything in a tight (bash) loop, because resources are shared with your Job which might be slowed down. Additionally, the Load of the Node might increase artificially (because of oversubscription).
If your job runs on multiple nodes (typically MPI Jobs), you also need to specify the node:
srun --overlap --jobid $JOBID --nodelist=licca020 --pty bash
Investigate live CPU/Memory/Disk usage
Using the above command establishing an interacting bash session, simply use for example htop
to monitor the live CPU and/or disk usage of your application. Of course you can also use different tools of your choice.
htop -u
Investigate live GPU/GPU-memory usage
Again, establish an interactive bash session attached to your Job first. Then use nvidia-smi
to check GPU utilization and/or GPU memory utilization.
nvidia-smi
This is an expensive utility which interrups your GPU calculation for a short time. Do NOT run nvidia-smi
in a tight loop as it will slow down your calculation significantly.