Thus far, the systems we have described are either specific to a given application or are not designed to deal with distributed environments. In distributed systems research, there has been much work on debugging and monitoring distributed applications as well as recovering when those applications fail. A common theme in the research is generating a trace of execution, which can then can be used to either determine what went wrong in an application or, if the trace is detailed enough, restart the application after failure. Such a trace of execution could be used to determine the provenance of a digital object. Because these systems support both a trace of program execution and are designed for distributed environments, we review them now.
Hollingsworth and Tierney provides a survey of current monitoring and debugging frame- works and tools for Grids [96], which follows on from earlier work in distributed systems [103, 17]. One can divide the components of an end-to-end monitoring and debugging framework into three levels (cf. 20.1, p.322 [96]). At the first level isinstrumentation, which is the integration of probes or sensors into software or hardware to measure their state. Sensors produce what is known as event data. This is data that a particular event, such as reading from the network, has occurred at a particular time. At the second level is presentation. Components, at the presentation level, gather event data produced by instrumentation, store it, and make it available for use. They are also responsible for archiving event data and managing the underlying sensors. Thus, the storage of event data is separated from the capturing and analysis of the data. The third level isanaly- sis. Components at this level analyse monitoring data to, for example, find bugs, spot performance problems, and detect security breaches.
An example of such a framework is the NetLogger toolkit [95]. It provides libraries for instrumentation, tools for collecting and archiving event data, and a visualisation analysis tool. To work in heterogeneous systems, Netlogger specifies a common format for all event data. Netlogger was built to capture very detailed information in a high performance setting, however, one of the drawbacks to the approach they take is their dependence on accurate synchronised clocks on all computers using Netlogger. While this is a reasonable assumption in an environment where all computers are controlled by one administrator, in a multi-institutional environment synchronized clocks are the exception not the norm.
Frameworks such as NetLogger and Ganglia [122] provide detailed logs of events that occur in large scale distributed systems. Using this data, the connections between events can be inferred. For example, the NetLogger analysis tool has the ability to visualise sets of events on a “lifeline” by associating events that operate on the same data. Other systems infer the causal connections between events (i.e. that event A was caused by event B). For example, the DeWiz system uses event data from Grid-based logging systems to infer the causal connections between events building an event graph model
Chapter 2 A Critical Analysis of Provenance Systems 26
[108]. DeWiz analyses, such as determining if a message was lost in transmission, can be performed on the event graph model using the Grid. Instead of relying on an event based log, Aguilera et al. views distributed systems as a set of connected “black boxes” and develops algorithms for inferring the causal paths between events from message based logs. By visualising these causal paths, performance bottlenecks can be identified by developers [4]. The approach of tracking messages passed between components modelled as black boxes can also be used for mobile agent security [180]. These systems are interesting because they provide a trace of the execution of distributed applications. However, they are not data-focused and thus do not help in determining the provenance of data produced by applications.
If the logs created by an application are detailed enough, then, the application can be successfully restarted using what is known as rollback-recovery protocols. Elnozahy et al. is a survey of the research in this area [61]. Rollback-recovery protocols rely on snapshots of a distributed systems’ state called checkpoints. Essentially, each program executing in the distributed system captures its state periodically, then, when an error occurs, these states can be meshed together to form a total picture of the system’s state [39]. The system can then be “rollled back” to a previous stable state and restarted. One difficulty these protocols have is how to mesh together states created at different times. To address this problem, checkpoints are combined with message logging data to allow the relationship between checkpoints to be ascertained. One way of identify- ing the relationship between checkpoints is to determine thecausal connectionbetween distributed checkpoints [174]. This approach is superior to related techniques because it isolates executing programs from the failure of other programs, avoids synchronized checkpointing, and reduces the overhead on storage because only the most recent check- point is stored [61]. Rollback-recovery protocols have been widely researched, however, they have not been widely deployed in practice, probably due to the complex modifica- tions to the operating system or runtime that they require [61]. For example, support for checkpointing in Java required rewriting the Java Virtual Machine [175]. In science, checkpointing has been used in the context of long running programs on multi-processor computers [152].
Both rollback-recovery protocols and monitoring systems provide mechanisms to trace the execution of distributed applications, however, this is not adequate for determin- ing provenance. First, they capture events and not data and thus do not allow the provenance of data to be determined. Second, none of the systems described provide a mechanism for identifying a particular digital object and retrieving its provenance. Third, even if the systems were modified to have this functionality, the data they pro- vide is at too low of a level to be of use to scientists. The systems do not represent the data at multiple levels of abstraction: a scientist cannot view past processes at a scientific level and then “drill down” to obtain more technical information. In sum- mary, distributed debugging, monitoring and recovery systems provide valuable insight
Chapter 2 A Critical Analysis of Provenance Systems 27
into capturing processes in distributed environments but do not provide the requisite functionality to determine provenance.