Lightweight and Streaming Debugging
General Considerations
Debugging is hard work, and mostly frustrating as one actually would like to do rather different things ... often under time pressure. Some debugging guide and easy-to-use tools are therefore desirable, we believe.
Errors can occur on different levels in the work flow. Compile-time errors, link-time errors, run-time errors. And each of them can have their plurality of causes, impacts and signatures. We cover here only run-time errors, and specifically for the HPC cases, as those are the ones most time consuming for us. Of course, we assume that you compiled your software cautiously in a clean and consistent environment. Debugging is difficult enough for well written and build software. Introducing additional side effects might render any attempt to debug errors possibly hopeless.
Debugging is about correctness! Optimization might be counter-productive for correctness! Sort your priorities!
1. Law of Debugging: Complexity is the enemy of Debugging!
Keep your cases (specifically for testing) simple!
Keep your environment as clean and simple as possible (try to live with system defaults)!
Keep errors reproducible! Specifically, it should be always the same error which you try to debug. Hitting moving targets is much more difficult!
Keep your test cases as small and short as possible! The higher the test frequency, the higher your chance of success (and also to get help).
2. Law of Debugging: Approach the problem systematically!
Because MPI tells you that an error occurred does not at all mean that MPI is the reason!
Increase the level of verbosity! Switch on debugging output if available! Compile with -g and -traceback options. More information is helpful!
Use tools to confine temporally and spatially, where and when does the error occur, if possible!
Document your test environment! This is consistent with the requirement for reproducibility.
3. Law of Debugging: Use the right diagnosing tools for the right error!
Hunting memory problems with MPI communication analysis tools is supposed to be fruitless.
Know your tools and their abilities! Look for simple tools, easy to use, because complexity is ....
4. Law of Debugging: Segmentation Faults are your friend!
Some errors like race conditions appear as hanging and idling processes. Jobs quietly sleeping away with no further output might go unnoticed wasting resources. Use timeouts for communications and operations!
Try to program, compile and run in a way such that clear points of failure appear. The earlier the better!
Don't fear to use export MALLOC_CHECK_=3! Or, use efence!
The closer a program aborts around the root cause for the abort, the easier is it to find that cause.
Of course, some errors appear only at larger scale. But debugging on the scale of hundreds or even thousands of nodes, or for long run-times, is very expensive. Try to avoid those scenarios!
User Build Job/Process Monitor
Often, jobs may not fail with a hard crash, but starve somehow, or assume sort of "ill state". These are probably the most difficult to analyze scenarios on black box systems without interactive access to the compute nodes.
Still, users can instrument their codes to include a health checker. These might include the self-surveillance of memory consumption or other information form the /proc or /sys file system. There are already some tools accomplishing this task. See next section!
If you need an adaptable monitor, a simple MPI wrapper script can do this (here with Intel MPI, where PMI_RANK is defined):
But take care to not create too much information such that the monitor influences or even dominates the job's workload! Useful shell commands might be top, ps, free, ...
If not a rank-wise monitor is needed, but just a node-wise one, you can specify on which ranks the monitor should run (as shown above). See also further examples below!
For instance, a node-wise memory monitor could be like
free -s 5 | while IFS= read -r line; do printf '%s %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "\$line"; done &> mon.\$(hostname)
which produces an output file for each node, with node memory information every five seconds. The output may look like
and can be analysed and plotted with any tool you deem suitable for this task. The above can be brought e.g. into a form digestible by Gnuplot or python/matplotlib via
awk 'BEGIN {print "time,mem[GB],mem[%]"} $3 == "Mem:" {printf("%s %s,%0.2f,%0.2f\n",$1,$2,$5/1024/1024,$5*100/$4)}' mon.file > monitor.csv
In Gnuplot, it may be as simple as
Further Examples
LRZ Debugging Tool Modules
These modules are currently hidden as they are not often needed, thank goodness.
> module use /lrz/sys/share/modules/extfiles/
> module av debugging
----------------------- /lrz/sys/share/modules/extfiles ------------------------
debugging/efence/2.2 debugging/heaptrack/1.2.0 debugging/strace/5.9
debugging/gperftools/2.9.1 debugging/ltrace/0.7.3
> module help debugging/strace
-------------------------------------------------------------------
Module Specific Help for /lrz/sys/share/modules/extfiles/debugging/strace/5.9:
modulefile "strace/5.9"
Provides debugging tool ltrace
Provides also MPI wrapper for MPI parallel tracing using strace,
e.g. inside Slurm (backslashes are necessary):
mpiexec -l strace_mpi_wrapper.sh -ff -t -o strace.\$\(hostname\).\$PMI_RANK <your-prog> <your-options>
-------------------------------------------------------------------
strace and ltrace
Trace system/library calls. Easy to use because it creates just ASCII output with trace that can be scrutinized with any editor.
For MPI programs, we created a wrapper, which can be used as follows:
mpiexec -l strace_mpi_wrapper.sh -ff -t -o strace.\$\(hostname\).\$PMI_RANK <your-prog> <your-options>
The reason for the wrapper is that hostname and PMI_RANK should be evaluated at task start on the respective node, which simplifies afterwards the assignment of the output to the rank ID. The strace options -ff and -t are for tracing threads and inserting time stamps into the output, respectively. Other ways are certainly also possible. strace_mpi_wrapper.sh is rather short
#!/bin/sh eval strace $*
It only injects the strace command, and cares for that each MPI rank writes to its own output file.
Similarly for ltrace.
heaptrack
Tracing memory allocations and consumption. Lightweight surrogate for Valgrind massif. Can be used to find memory leaks, out-of-memory events, and more, also for cases when the program aborts or is killed by an OOM killer.
Simply use it like this
mpiexec heaptrack <your-prog> <your-options>
It produces ASCII files with name scheme heaptrack.<your-prog>.<HOSTNAME>.<PID>.zst (or some other compression ending), which can be analyzed using heaptrack_print or heaptrack_gui (if available). The assignment to the MPI rank ID is a little bit tough maybe. Check the rank-wise output for the respective file name.
time
Can be used to obtain a simple resource consumption overview like for run-time, memory consumption.
As bash wraps time with slightly less functionality, you must use it via
\time -v <prog-name>
or
env time -v <prog-name>
gdb
Not so lightweight anymore, but very powerful, is the GNU debugger gdb. We provide modules for it. gdb is a full fledged debugger, and can be used remarkably flexibly.
gdb does not support MPI-awareness of it's own. But as we talking about "lightweight debugging", we must admit that interactive MPI parallel debugging is far from being efficient or "lightweight" anyway. Especially if it is on the scale of several hundred or even thousands of compute nodes with each order of 50 to 100 MPI ranks. On this scale, interactive debugging is probably not only waste of resources, but possibly even useless. So, we discourage here this approach explicitly, and vote for searching different methods.
Although there are certainly many more possibilities to employ gdb for debugging, we want to show a rather focused approach for debugging sorts of aborts like segmentation faults, floating point errors, etc. So, it is mandatory that the error appears deterministically and reproducibly at the same location in the program's workflow. You should compile your code with the -g option to include reasonable debugging symbol. This eases navigation through the problem code later. If you can reproduce the error also with lower levels of optimization, using these lower levels is recommended, too, as aggressive optimization can definitely obscure the underlying problem.
In order not to work in interactive mode, gdb has sort of an "non-interactive" mode. In this case then, one needs to specify the gdb commands on the command-line. A very simple example may illustrate the usage of gdb.
~> cat > test.c
#include <stdio.h>
int main() {
int arr[5];
arr[0] = 1;
printf(" arr[10] = %i\n", arr[10]);
arr[0] /= 0;
}
~> gcc -g test.c
~> gdb -q ./a.out -ex "set width 1000" -ex "thread apply all bt" -ex "run" -ex "bt" -ex "set confirm off" -ex quit
Reading symbols from ./a.out...
Starting program: ***********/a.out
arr[10] = -140434755
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400533 in main () at test.c:6
6 arr[0] /= 0;
#0 0x0000000000400533 in main () at test.c:6
The gdb commands can be checked in the gdb reference. In short, set width sets the output terminal width. thread apply all bt sets all back tracing on all threads. run starts the program in the debugger. bt is the back tracing (what we here primarily needed). set confirm off switches of interactive requests for confirmations. And quit finishes the gdb session once the back trace was written.
gdb can analyse threaded programs natively. But in order to see something, one may need to toggle between threads. Otherwise, all threads are shown. That's definitely a little bit more difficult to follow. Other tools like ddt or totalview might be more comprehensible.
MPI-parallel programs can be run but with the same philosophy. We recommend some sort of wrapper script (here for Intel MPI):
PROGRAM="./a.out" PARAMS="-l" # if any command line arguments and flags are required for the program to run echo "gdb -q $PROGRAM -ex \"set width 1000\" -ex \"thread apply all bt\" -ex \"run $PARAMS\" -ex bt -ex \"set confirm off\" -ex quit 2>&1 > gdb.\$PMI_RANK" > wrapper.sh chmod u+x wrapper.sh mpiexec ./wrapper.sh
Each MPI rank outputs the gdb results into its own file gdb.<rank ID>, and can be scrutinized afterwards individually. As these files might be large, we recommend to write them into SCRATCH.
Unless gdb represents a too large overhead causing load-imbalances and thereby maybe causing secondary issues unrelated to the primary one, you can also focus on only few ranks by filtered gdb execution based on PMI_RANK.