H. Sobre la secuencia didáctica y su cronograma
I. Sobre la evaluación
Debugging has a long history in software engineering [60, 116, 118]. Reverse execu-
tion or back-stepping allows a debugger to step backwards through the program execution to a previous state, in addition to stepping forwards to the next state. Reverse execution is commonly achieved though the use of checkpointing [95, 281, 282], event/message log-
ging [26, 161, 228, 292], or a combination of the two techniques [23, 150, 152, 201, 272].
When used in combination the parallel debugger restarts the program from a checkpoint and replays the execution up to the breakpoint. A less common implementation technique is the actual execution of the program code in reverse without the use of checkpoints [4].
This technique faces challenges of handling complex logical program structures which can interfere with the end user interaction and applicability to certain programs. Though most
of these techniques focus on transparently providing reverse execution, some have also explored language and compiler extensions to support such activities [289].
There are three predominate axes to consider when debugging large-scale applica- tions [152]. The first axis is that of runtime, which becomes a factor when attempting
to address a bug that only becomes apparent after hours of computation because of a race condition, or changing computational state in the parallel program. The second axis is the number of processes involved, which becomes a significant factor in applications that dy- namically adjust the algorithms employed in response to the scale in which the application is run. Finally, the third axis is the program size in terms of lines of code involved in the analysis.
Event logging is used to provide a deterministic re-execution of the program while de- bugging. Often this allows the debugger to reduce the number of processes involved in the debugging operation by simulating their presence through replaying events from the log. This is useful when debugging an application with a large number of processes. Event logging has also been used to allow the user to view a historical trace of program execution that can be inspected while debugging to trace the changes of a variable in reverse without re-execution [259, 260].
Message logging is a sub-domain of event logging in which only messages are logged instead of all non-deterministic events that might influence the application. This has been implemented above the MPI interface [68] and within the implementation [26, 176] for
the explicit purpose of supporting debugging operations. There are two core techniques for event replay: Contents-based replay and ordering-based replay [229]. In contents-based
replay, the traces include the values of the events received or variables read. This typically produces larger trace files, but does not require the participation of all processes during replay since values do not need to be recomputed. In ordering-based replay, the traces include the relative order in which the events occurred. This produces smaller trace files at the cost of recomputing the values every time. The relative partial ordering of events is based on Lamport clocks [158, 227]. Both [161] and [229] present algorithms for
Adaptive message logging and checkpointing techniques have also been explored to reduce the size of the message logs [189, 190, 258, 287].
Checkpoint/restart is used to return the debugging session to an intermediary point in the program execution without replaying from the beginning of execution. For programs that run for a long period of time before exhibiting a bug, checkpointing can be used to focus the debugging session on a smaller period of time closer to the bug. Checkpoint/restart techniques are also useful for program validation and verification techniques that may be allowed to run concurrently with the parallel program on smaller sections of the execution space [243]. [192] discusses how to achieve replaying without using message logging by
employing a technique similar to message induced checkpointing (see Section 4.7).
In addition to reducing the amount of time and number of processes involved in the debugging process, program slicing is often used to reduce the amount of code that needs to be analyzed [280]. This is a useful technique when debugging large software systems
such as operating systems.
Some of this work has focused on the automatic validation of message passing pro- grams. Since all messages are traced, debuggers can apply algorithms to detect common parallel programming bugs such as race conditions and deadlocks [65, 188, 243, 287]. A
technique called flowback analysis assists debuggers in finding race conditions based on the causal relationship between events [179].
For HPC applications, the MPI standard [178] has become the de facto standard message
passing programming interface. Even though some parallel debuggers support MPI appli- cations, there is no official standard interface for the interaction between the parallel de- bugger and the MPI implementation. However, the MPI implementation community has in- formally adopted some consistent interfaces and behaviors for such interactions [58, 115].
The MPI Forum is discussing including these interactions into a future MPI standard. Chap- ter 5 will discuss how Open MPI was extended to include a design that supports C/R- enabled parallel debugging.