From our analysis, it is apparent that knowing the connections and relationships between data is critical for understanding the provenance of digital objects. Specifically, causal relationships allow us to create the causality graphs described in Section2.4.2. Thus, we discuss the notion of causation in more detail and provide a specific definition suitable for provenance in multi-institutional environments. Intuitively, most people have a general idea that causality is the relationship between cause and effect. However, philosophers have argued about its exact meaning for thousands of years6. We best understand it
6
Chapter 2 A Critical Analysis of Provenance Systems 35
using a counterfactual definition [116], that is: if A had not occurred, then B would not have occurred, all else being equal.
Much of the work on causation in computer science has focused on inferring causal relationships from data sets [168]. Inferring causal relationships helps solve problems in a variety of areas including artificial intelligence [144] and data mining [160]. Pearl gives a systematic and mathematical treatment of causality [145]. He defines causality in terms of probabilistic functions and directed acyclic graphs. Specifically, the following definition for causal structure is given.
A causal structure of a set of variables V (defined as probability distribu- tions) is a directed acyclic graph (DAG) in which each node corresponds to a distinct element of V, and each link represents a direct functional relationship among the corresponding variables.
From this definition, Pearl goes on to present tools for mathematically reasoning about and inferring causality. A wide variety of techniques based on similar models are available for inference of causal relationship relationships, particularly using Bayesian methods [168]. Unlike this work, we are not trying to infer causality from some data set but instead rely on observations to document causality within systems, specifically within distributed systems. We use the notion of “observation by participation” that is a component within a distributed system can observe data or events when it processes such data or generates such events.
In distributed systems research, causality is discussed primarily with respect to asyn- chronous distributed systems. Such systems are modelled by sets of automata that perform three kinds of actions: sending a message (send event), receiving a message (receive event), and internal events [119, 458]. For modelling reliable first in, first out communication channels between automata, Lynch defines a cause function that maps a receive event to a prior send event in the same channel,β [119, 460]. This function is defined as follows:
1. For every receive event E1, E1 andcause(E1) contain the same message argument. 2. causeis surjective (onto)
3. causeis injective (one-to-one)
4. causepreserves order, that is, there do not exist receive events E1 and E2 with E1 preceding E2 inβ and cause(E2) precedingcause(E1) in β.
Intuitively, the definition is saying that receipt of a message is caused by the sending of that same message. Lynch expands the notion of causality to include the idea that an
Chapter 2 A Critical Analysis of Provenance Systems 36
event occurring in an automata is caused by all the preceding events in the automata. This is termed the depends onrelationship and is defined as follows [119, 465]:
An event E2 depends onE1 if one of the following holds:
1. E1 and E2 are events of the same automata where E1 precedes E2.
2. E1 is a send event and E2 is the corresponding receive event (as defined by the
causerelationship above).
3. E1 and E2 are related by a chain of relationships of types 1 and 2.
The definition given in Lynch for depends on is the same as the happened before rela- tionship described by Lamport for the ordering of events in a distributed system [111]. Thus, as stated in Lamport, this definition encompasses all possible causal relationships within an automata (i.e. because all causal relationships are temporal capturing, allhap- pened beforerelationships will capture all possible causal relationships). This approach is conservative and appropriate in systems where causality must only be understood in an abstract manner, for example, where understanding the causal relationships existing within an automata is unnecessary.
However, we are particularly interested in understanding causality between data and thus we need to know in more detail how data was exactly transformed and the explicit causal connections between data items within automata. The definition provided by Lynch does not support the kind of detailed specific expression of the causal connection between data items that is required for the provenance of data to be adequately expressed. To provide greater detail, we, therefore, define causality in terms of a combination of the ideas presented above. Because multi-institutional scientific systems are distributed, we adopt the notion that the receiving of a message is caused by its sending. Secondly within automata (or services), we use the definition provided by Pearl, namely, that causality is expressed as functional relationships between variables (i.e. data). We expect such functional relationships to be expressed by the automata that executes the function, i.e. the observer of the execution. Using this functional notion allows causality between data items to be clearly indicated. Furthermore, it allows for the type of causality to be identified. For example, we can say not only that data item D2 was caused by data item D1 but we can also say that D2 was caused by a a Fast Fourier transformation on D2.
To bring these two definitions together, we amalgamate data and events together under the notion of an occurrence. An occurrence is either an event or a data item at event. Therefore, sending a message is an occurrence in which sending is the event and the message is the data at that event. This association provides for the location of a data item at given point in time as defined by an event. Again, we reiterate the point that
Chapter 2 A Critical Analysis of Provenance Systems 37
an occurrence can only be known by the executor of it, any other automata or service would only be able to infer the existence of the occurrence.
Using the notion of occurrence, we define our own notion of causality as follows:
Definition 2.1 (Causation). An occurrence O2 is caused by an occurrence O1 if one of the following hold:
1. O2 is functionally related to O1 (i.e. O2 has a direct functional relationship to O1 from Pearl).
2. O1 is the sending of a message and O2 is the corresponding receiving of the message (as defined by the causerelationship from Lynch).
3. O1 and O2 are related by a chain of relationships of types 1 and 2.
In summary, Definition 2.1differs from the definitions provided by Lamport and Lynch in two important aspects. First, the definition deals with both events and data. Second, it provides a specific traceable causal connection between the reception of a message and subsequent sending of another message. These two aspects make Definition 2.1 more suitable for determining the provenance of results.
In this section, we have briefly discussed two views of causality from distributed systems and causal inference research. Based on these views, we defined a particular notion of causality suited to provenance and multi-institutional scientific systems.