Debugger and profiler

Overview

Debuggers find errors in programs. In parallel programs, potential errors that could occur at some point are also found. Profilers, on the other hand, can identify performance problems in programs. The UNI Kassel offers the debugger TotalView and the profiler Vampir in the Linux cluster, supported by the HKHLR.

Debugger

Debugging the source code of a parallelized program is even more complex than with a sequential program. A crash, an infinite loop or an incorrect result can now occur not only in different threads running in parallel, but also on several computers (referred to as "nodes" in the cluster).

The UNI-Kassel uses the debugger TotalView, supported by the HKHLR. An introduction to TotalView is regularly offered as a workshop. TotalView offers the following options, among others:

The program code can be executed line by line, as with a sequential debugger, but with the difference that the source code is available individually for all threads on all nodes. This means that you can observe all values of all variables from all threads during debugging and localize the thread that is responsible for an error.
With an additional compiler option, a C, C++ and Fortran program can be compiled with additional debug information. If the program crashes, a so-called coredump file is created, which can be examined with TotalView. This file contains the exact state of the program after the crash and can (among many other possibilities) determine the line in the source code where the program was located.
TotalView can "watch" a running program and automatically detect errors in the parallel code, such as "race conditions", which the programmer may not be aware of.

Start of TotalView in the cluster:

module load totalview/2017.1.21 totalview

Important: Even a program that works with small inputs can still contain errors that have an effect at some point when calculations are performed for several days on many nodes in parallel. For example, no doctoral thesis should be based on figures from a self-written program that has never been run through an automatic debugger.

Remember: "A wrong program always has the right to do what the programmer expects it to do". However, tests can only (with luck?) prove the presence of errors, but not their absence.

Profiler

A profiler can observe a program at runtime and detect performance problems. For example, it is possible to detect in which methods a program has been running for how long.

An example: The profiler recognizes that the program spends a surprisingly long time in a method "Math.pow(x,y)". The method normally calculates the power "x to the power of y". On closer inspection, the method is only used in the program to calculate the square of int variables. Since the method calculates internally with double numbers and uses various tricks that do not work with a simple squaring of integers, this method call can simply be replaced by a separate method that simply returns "x * x" and the program is then significantly faster.

Without the profiler, it is hardly possible to find such weaknesses in your own program and optimize it in this way. With a profiler, the chances are quite good.

In the Linux cluster, the profiler "Vampir" is used, which, like TotalView, was also introduced by the HKHLR. An introduction to Vampir is regularly offered as a workshop.