The purpose of this thesis is to construct a process model on the basis of an event log, as described above. Assuming that there is a set of activity labels, A, the goal of a process model is to decide which activities need to be executed and in what order. Activities can be executed sequentially, activities can be optional or concurrent, and the repeated execution of the same activity may be possible.
47
This thesis focuses on process models representing causal dependencies, for instance, if an activity (event) is always followed by another activity (event) it is likely that there is a dependency relation between both activities (events). Process mining algorithms like Heuristic Miner algorithm (Weijters & Ribeiro, 2010; Weijters et al., 2006) can automatically generate these kinds of process models. Figure 3-2 shows an example of dependency graphs of writing processes. The numbers in the boxes indicate the frequency of the writing activities. The decimal numbers along the arcs show the dependency measures (described below) of transitions between two activities, and the natural numbers indicate the number of times this order of activities occurs among the five types, start, end, drafting, revising, and editing.
Heuristic mining algorithm as described in (Weijters & Ribeiro, 2010) generates dependency graphs. Moreover, this algorithm takes frequencies of events and sequences into account when constructing a process model. The basic idea is that infrequent paths should not be incorporated into the model. Both the representational bias provided by dependency graphs and the use of frequencies makes the approach able to handle noise in the log files and much more robust than most other approaches.
There are three basic relations between two activities based on the sequence of their execution. The following a and b are two activities in a sequence of an event log, W:
1. a>b: a is directly followed by b (direct successor)
2. a>>b: a is directly followed by b and then by a again (length-two loops) 3. a>>>b: a is eventually follow by b (indirect successor)
Note that length-one loops are the relations of a>a.
The heuristic mining algorithm only considers mainly the first two relations. Particularly, the algorithm uses the dependency measure, defined below. |a>b| is the number of time a > b occurs in the sequence W.
→ ! = | > !| + |! > | + 1 &' ≠ !| > !| − |! > | → = | > | + 1| > |
→)! = | ≫ !| − |! ≫ | | ≫ !| + |! ≫ | + 1
48
First, remark that the value of → ! is always between -1 and 1. Some simple examples demonstrate the rationale behind the equations above. If in 5 traces, activity A is directly followed by activity B but the other way around never occurs, the value of + → , = 5/6 = 0.833 indicating that we can not be completely sure of the dependency relation (only 5 observations possibly caused by noise). However if there are 50 traces in which A is directly followed by B but the other way around never occurs, the value of + → , = 50/51 = 0.980 indicates that the probability of the dependency relation is high. If there are 50 traces in which activity A is directly followed by B and noise caused B to follow A once, the value of + → , is 49/52 = 0.94 indicating that the probability of the dependency relation is high.
A high value of → ! strongly suggests that there is a dependency relation between activities a and b. The algorithm computes the dependency measures of all relations of all pairs of activities and constructs the dependency diagrams based on the dependency measures and user-defined parameters as explained below.
3.2.2.1 Parameters of Heuristic Miner
This thesis uses Heuristic Miner (HM) algorithm, which was implemented on a process mining framework, ProM (ProM, 2013). There are two different options to construct dependency graphs: with and without “all-tasks-connected”, in which “tasks” refer to activities.
Without using the all-tasks-connected option, three threshold parameters are available in the HM to indicate that we will accept a dependency relation: (i) the dependency threshold, (ii) the length-one loops threshold and (iii) the length-two loops threshold. However, by using different parameters it is, for instance, possible to build a model without length-one loops (choose the length-one loops threshold = 1.0). With these thresholds, one can indicate what dependency relations are accepted between activities that have a dependency measure above the value of the dependency thresholds resulting in a control-flow model with only the most frequent activities and behaviour. By changing the parameters one can influence how complete the control-flow model becomes (Weijters & Ribeiro, 2010).
The advantage of using the all-tasks-connected heuristic is that many dependency relations are tracked without any influence of any parameter setting. The result is a relative complete and understandable control-flow model even if there is some noise in the log. The underlying intuition in the all-tasks-connected heuristic is that each
49
non-start task must have at least one other task that is its cause, and each non-end task must have at least one dependent task. Using this information HM builds a work flow model taking the best candidates (i.e., with the highest → ! measure).
Without the all-tasks-connected option, HM accepts dependency relations between tasks that have (i) a dependency measure above the value of the dependency threshold, and (ii) have a dependency measure close to the first already accepted dependency value (i.e., for which the difference with the best dependency measure is lower than the value of relative-to-best threshold). However, if this heuristic is used in the context of a low-structured process the result is a very complex model with all tasks and a high number of connections. Therefore, this option is not preferable for this thesis. Full detail of parameters of Heuristic Miner can be found in (Weijters & Ribeiro, 2010)
Therefore, to extract writing process model, dependency diagrams, this thesis uses the all-tasks-connected option with the default threshold parameters. All three thresholds are set to 0.9: (i) the Dependency threshold, (ii) the Length-one loops threshold 0.90 and (iii) the Length-two loops threshold. This research also added two artificial activities: start and end to all process cases in order to specify the initial and final activities of the processes.
3.2.2.2 Conformance checking
Conformance checking is a technique to relate events in the event log to activities in the process model and compares both. The goal is to find commonalities and discrepancies between the modelled behaviour and the observed behaviour. Particularly, conformance checking techniques can be used for measuring the quality of process discovery algorithms. Determining the quality of a process mining result is difficult and is characterized by many dimensions. In his book, van der Aalst (2011) refers to four quality criteria of discovered process models: fitness, precision, generalization, and simplicity. The description of these quality criteria is explained in the book. Of the four quality criteria, fitness is the most related to conformance checking. This thesis focused exclusively on fitness (i.e., the proportion of events in the log that can be explained by a process model). Process models discovered by using a process mining algorithm like Heuristic Miner are used to extract patterns of writing activities in this research. Therefore, it is important to measure how much of
50
the observed behavior in the event log is captured by the process model. This measurement is indicated by the fitness.
The computation of the fitness mainly relies on two data structures: (i) the process model, which is the dependency graph (DG) and (ii) the event log that contains information about the ordering of the activities. One way to measure the fitness between event logs and process models is to replay the log in the model and somehow measure the mismatch. The replay of every logical log trace starts with the marking of the initial place in the model. Then, the transitions that belong to the logged events (activities) in the trace are read one after another. While replay progresses, we count the number of tokens that had to be created artificially (i.e., the transition belonging to the logged event was not enabled and therefore could not be successfully executed) and the number of tokens that were left in the model, which indicate that the process was not properly completed. The value of fitness(L,N) defined in (van der Aalst, 2011) is between 0 (very poor fitness) and 1 (perfect fitness). The intuition of fitness(L,N) = 0.9 is that about 90% of the events can be replayed correctly. This thesis calculates the fitness of a process model using the fitness utility of ProM (ProM, 2013).