Lightweight and Streaming Debugging

General Considerations

Debugging is hard work, and mostly frustrating as one actually would like to do rather different things ... often under time pressure. Some debugging guide and easy-to-use tools are therefore desirable, we believe.

Errors can occur on different levels in the work flow. Compile-time errors, link-time errors, run-time errors. And each of them can have their plurality of causes, impacts and signatures. We cover here only run-time errors, and specifically for the HPC cases, as those are the ones most time consuming for us. Of course, we assume that you compiled your software cautiously in a clean and consistent environment. Debugging is difficult enough for well written and build software. Introducing additional side effects might render any attempt to debug errors possibly hopeless.

Debugging is about correctness! Optimization might be counter-productive for correctness! Sort your priorities!

1. Law of Debugging: Complexity is the enemy of Debugging!

Keep your cases (specifically for testing) simple!
Keep your environment as clean and simple as possible (try to live with system defaults)!
Keep errors reproducible! Specifically, it should be always the same error which you try to debug. Hitting moving targets is much more difficult!
Keep your test cases as small and short as possible! The higher the test frequency, the higher your chance of success (and also to get help).

2. Law of Debugging: Approach the problem systematically!

Because MPI tells you that an error occurred does not at all mean that MPI is the reason!
Increase the level of verbosity! Switch on debugging output if available! Compile with -g and -traceback options. More information is helpful!
Use tools to confine temporally and spatially, where and when does the error occur, if possible!
Document your test environment! This is consistent with the requirement for reproducibility.

3. Law of Debugging: Use the right diagnosing tools for the right error!

Hunting memory problems with MPI communication analysis tools is supposed to be fruitless.
Know your tools and their abilities! Look for simple tools, easy to use, because complexity is ....

4. Law of Debugging: Segmentation Faults are your friend!

Some errors like race conditions appear as hanging and idling processes. Jobs quietly sleeping away with no further output might go unnoticed wasting resources. Use timeouts for communications and operations!
Try to program, compile and run in a way such that clear points of failure appear. The earlier the better!
Don't fear to use export MALLOC_CHECK_=3! Or, use efence!
The closer a program aborts around the root cause for the abort, the easier is it to find that cause.


Of course, some errors appear only at larger scale. But debugging on the scale of hundreds or even thousands of nodes, or for long run-times, is very expensive. Try to avoid those scenarios!

User Build Job/Process Monitor

Often, jobs may not fail with a hard crash, but starve somehow, or assume sort of "ill state". These are probably the most difficult to analyze scenarios on black box systems without interactive access to the compute nodes.

Still, users can instrument their codes to include a health checker. These might include the self-surveillance of memory consumption or other information form the /proc or /sys file system. There are already some tools accomplishing this task. See next section!

If you need an adaptable monitor, a simple MPI wrapper script can do this (here with Intel MPI, where PMI_RANK is defined):

Example Script with rank-wise Monitor
#!/bin/bash
[...]                                                   # Slurm header
module load slurm_setup 

cat > mon_wrap.sh << EOT
#!/bin/bash
[ "\$PMI_RANK" == "0" ] && echo "[\$(date '+%Y-%m-%d %H:%M:%S')] Start"
[ "\$PMI_RANK" == "0" ] && echo "running \$*"
if [ "\$(echo \$PMI_RANK%\$SLURM_NTASKS_PER_NODE | bc)" == "0" ]; then
   env > env.\$(hostname).\$PMI_RANK
   top -b -d 5 -n 40 -u \$USER > mon.\$(hostname) &
fi
eval \$* 2>&1 | while IFS= read -r line; do printf '[%s] %s\n' "\$(date '+%Y-%m-%d %H:%M:%S')" "\$line"; done
[ "\$PMI_RANK" == "0" ] && echo "\$(date '+%Y-%m-%d %H:%M:%S') Finish"
EOT
chmod u+x mon_wrap.sh

mpiexec -l ./mon_wrap.sh <user program> <prog parameters>

But take care to not create too much information such that the monitor influences or even dominates the job's workload! Useful shell commands might be top, ps, free, ...

If not a rank-wise monitor is needed, but just a node-wise one, you can specify on which ranks the monitor should run (as shown above). See also further examples below!

For instance, a node-wise memory monitor could be like

free -s 5 | while IFS= read -r line; do printf '%s  %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "\$line"; done &>  mon.\$(hostname) 

which produces an output file for each node, with node memory information every five seconds. The output may look like

free output
2024-03-24 09:04:20                total        used        free      shared  buff/cache   available
2024-03-24 09:04:20  Mem:      131166260    36430936    88130324      233072     6605000    93044136
2024-03-24 09:04:20  Swap:       8388604     8071664      316940

and can be analysed and plotted with any tool you deem suitable for this task. The above can be brought e.g. into a form digestible by Gnuplot or python/matplotlib via

awk 'BEGIN {print "time,mem[GB],mem[%]"}  $3 == "Mem:" {printf("%s %s,%0.2f,%0.2f\n",$1,$2,$5/1024/1024,$5*100/$4)}' mon.file > monitor.csv

In Gnuplot, it may be as simple as

monitor.pl
set xdata time
set timefmt '%Y-%m-%d %H:%M:%S'
set datafile separator ','
set terminal pdf
set output 'monitor.pdf'
set xlabel "time"
set ylabel "memory consumption"
plot "monitor.csv" u 1:3 w l t 'mem [%]'

Further Examples

Node-wise Memory Monitor attached to mpiexec Call
#!/bin/bash
# This example is for SNG. But applies also to other clusters.
#SBATCH -o log.%N.%j_2N.out
#SBATCH -D .
#SBATCH -J mem_mon_test
#SBATCH --get-user-env 
#SBATCH -p test                      # -M <cluster> -p <partition>   on CoolMUC-4
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12           # Check the cluster docu, consult Slurm
#SBATCH --mail-type=none
#SBATCH --time=00:05:00
#SBATCH --export=none
#SBATCH --account=<yout project ID>  # skip on CoolMUC-4

module load slurm_setup 
module load intel-mpi     # required for placement-test

# create a wrapper script, which distributes the monitor with the mpiexec start-up rank-wise
cat > wrapper.sh << EOT
#!/bin/bash
monitor () {
  rm \$2
  totMem=\$(cat /proc/meminfo | grep MemTotal | awk '{print \$2}')
  while :;  
  do
    date="\$(date '+%Y-%m-%d %H:%M:%S')"
    let usedMmem=\$totMem-\$(cat /proc/meminfo | grep MemFree | awk '{print \$2}')
    echo "\$date  \$usedMmem" >> \$2
    sleep \$1
  done
}

# set monitor only on first rank on each node ... other ranks skip this
if [ "\$(echo \$PMI_RANK%\$SLURM_NTASKS_PER_NODE | bc)" == "0" ]; then
   echo "[\$(date '+%Y-%m-%d %H:%M:%S')] Start monitor on \$(hostname)"
   monitor 1 mon.\$(hostname) &
fi

# actual rank execution
eval \$*

# cleanup
killall -s 9 -u $USER
EOT
chmod u+x wrapper.sh

# placement-test is a dummy MPI/OpenMP program, and simulates MC workload for 20 seconds (-d 20); -t #OMP threads
mpiexec ./wrapper.sh /lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi -d 20 -t $SLURM_CPUS_PER_TASK
Node-wise Process Monitor using independent and overlapping srun Calls
#!/bin/bash
#SBATCH -o log.%N.%j_2N.out
#SBATCH -D .
#SBATCH -J sng_mon_test
#SBATCH --get-user-env 
#SBATCH --partition=test              # -M cm4 -p cm4_tiny .... check cluster docu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12            # check cluster docu
#SBATCH --hint=nomultithread
#SBATCH --mail-type=none
#SBATCH --time=00:05:00
#SBATCH --export=none
#SBATCH --account=<project ID>        # not required on CM4

module load slurm_setup 
module load intel-mpi

# create monitor script (can be done outside the scipt)
cat > monitor.sh << EOT
#!/bin/bash
rm -rf mon.\$(hostname)    # remove old monitor 
echo "[\$(date '+%Y-%m-%d %H:%M:%S')] Start monitor on \$(hostname)"
top -b -d 5 -n 40 -u \$USER >> mon.\$(hostname)
sleep 5
EOT
chmod u+x monitor.sh

# start monitor (one per node) with no MPI, with oversubscribing, overcommitment, overlap, no memory requirement, no CPU bindings
srun -N $SLURM_NNODES -n $SLURM_NNODES --ntasks-per-node=1 -c 1 --overlap -O -s --mpi=none --cpu-bind=no --mem=0 ./monitor.sh &

sleep 5
echo "Start actual program"

# start MPI program (mpiexec seems not to work, keeps hanging)
srun -N $SLURM_NNODES --ntasks-per-node=$SLURM_NTASKS_PER_NODE -c $SLURM_CPUS_PER_TASK --overlap --mpi=pmi2 /lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi -d 20 -t $SLURM_CPUS_PER_TASK

LRZ Debugging Tool Modules

These modules are currently hidden as they are not often needed, thank goodness.

> module use /lrz/sys/share/modules/extfiles/
> module av debugging
----------------------- /lrz/sys/share/modules/extfiles ------------------------
debugging/efence/2.2        debugging/heaptrack/1.2.0  debugging/strace/5.9  
debugging/gperftools/2.9.1  debugging/ltrace/0.7.3
> module help debugging/strace
-------------------------------------------------------------------
Module Specific Help for /lrz/sys/share/modules/extfiles/debugging/strace/5.9:

modulefile "strace/5.9"
  Provides debugging tool ltrace
  Provides also MPI wrapper for MPI parallel tracing using strace,
  e.g. inside Slurm (backslashes are necessary):
    mpiexec -l strace_mpi_wrapper.sh -ff -t -o strace.\$\(hostname\).\$PMI_RANK <your-prog> <your-options>
-------------------------------------------------------------------

strace and ltrace

Trace system/library calls. Easy to use because it creates just ASCII output with trace that can be scrutinized with any editor.

For MPI programs, we created a wrapper, which can be used as follows:

mpiexec -l strace_mpi_wrapper.sh -ff -t -o strace.\$\(hostname\).\$PMI_RANK <your-prog> <your-options>

The reason for the wrapper is that hostname and PMI_RANK should be evaluated at task start on the respective node, which simplifies afterwards the assignment of the output to the rank ID. The strace options -ff and -t are for tracing threads and inserting time stamps into the output, respectively. Other ways are certainly also possible. strace_mpi_wrapper.sh is rather short

#!/bin/sh
eval strace $*

It only injects the strace command, and cares for that each MPI rank writes to its own output file.

Similarly for ltrace.

heaptrack

Tracing memory allocations and consumption. Lightweight surrogate for Valgrind massif. Can be used to find memory leaks, out-of-memory events, and more, also for cases when the program aborts or is killed by an OOM killer.
Simply use it like this

mpiexec heaptrack <your-prog> <your-options>

It produces ASCII files with name scheme heaptrack.<your-prog>.<HOSTNAME>.<PID>.zst (or some other compression ending), which can be analyzed using heaptrack_print or heaptrack_gui (if available). The assignment to the MPI rank ID is a little bit tough maybe. Check the rank-wise output for the respective file name.

time

Can be used to obtain a simple resource consumption overview like for run-time, memory consumption.

As bash wraps time with slightly less functionality, you must use it via

\time -v <prog-name>

or

env time -v <prog-name>

gdb

Not so lightweight anymore, but very powerful, is the GNU debugger gdb. We provide modules for it. gdb is a full fledged debugger, and can be used remarkably flexibly.

gdb does not support MPI-awareness of it's own. But as we talking about "lightweight debugging", we must admit that interactive MPI parallel debugging is far from being efficient or "lightweight" anyway. Especially if it is on the scale of several hundred or even thousands of compute nodes with each order of 50 to 100 MPI ranks. On this scale, interactive debugging is probably not only waste of resources, but possibly even useless. So, we discourage here this approach explicitly, and vote for searching different methods.

Although there are certainly many more possibilities to employ gdb for debugging, we want to show a rather focused approach for debugging sorts of aborts like segmentation faults, floating point errors, etc. So, it is mandatory that the error appears deterministically and reproducibly at the same location in the program's workflow. You should compile your code with the -g option to include reasonable debugging symbol. This eases navigation through the problem code later. If you can reproduce the error also with lower levels of optimization, using these lower levels is recommended, too, as aggressive optimization can definitely obscure the underlying problem.

In order not to work in interactive mode, gdb has sort of an "non-interactive" mode. In this case then, one needs to specify the gdb commands on the command-line. A very simple example may illustrate the usage of gdb.

~> cat > test.c
#include <stdio.h>
int main() {
   int arr[5];
   arr[0] = 1;
   printf(" arr[10] = %i\n", arr[10]);
   arr[0] /= 0;         
}

~> gcc -g test.c

~> gdb -q ./a.out -ex "set width 1000" -ex "thread apply all bt" -ex "run" -ex "bt" -ex "set confirm off" -ex quit

Reading symbols from ./a.out...
Starting program: ***********/a.out
 arr[10] = -140434755

Program received signal SIGFPE, Arithmetic exception.
0x0000000000400533 in main () at test.c:6
6	   arr[0] /= 0;
#0  0x0000000000400533 in main () at test.c:6

The gdb commands can be checked in the gdb reference. In short, set width sets the output terminal width. thread apply all bt sets all back tracing on all threads. run starts the program in the debugger. bt is the back tracing (what we here primarily needed). set confirm off switches of interactive requests for confirmations. And quit finishes the gdb session once the back trace was written.

gdb can analyse threaded programs natively. But in order to see something, one may need to toggle between threads. Otherwise, all threads are shown. That's definitely a little bit more difficult to follow. Other tools like ddt or totalview might be more comprehensible.

MPI-parallel programs can be run but with the same philosophy. We recommend some sort of wrapper script (here for Intel MPI):

PROGRAM="./a.out"
PARAMS="-l"         # if any command line arguments and flags are required for the program to run

echo "gdb -q $PROGRAM -ex \"set width 1000\" -ex \"thread apply all bt\" -ex \"run $PARAMS\" -ex bt -ex \"set confirm off\" -ex quit 2>&1 > gdb.\$PMI_RANK" > wrapper.sh
chmod u+x wrapper.sh

mpiexec ./wrapper.sh

Each MPI rank outputs the gdb results into its own file gdb.<rank ID>, and can be scrutinized afterwards individually. As these files might be large, we recommend to write them into SCRATCH.

Unless gdb represents a too large overhead causing load-imbalances and thereby maybe causing secondary issues unrelated to the primary one, you can also focus on only few ranks by filtered gdb execution based on PMI_RANK.