• No se han encontrado resultados

The existence of correlation does signify a strong statistical relationship among the linked items (or events in our case), but it is not always true [102]. The relationship between events can be made certain by finding evidence of causation. A cause and effect connec- tion demonstrates that the events are dependent on each other given certain conditions and are defined with the help of Markov chains [103]. A Markov chain is a sequence and description of random events that cause the system to transition from one state to any other state. It also includes the probability of each transition, which depends on the state reached due to the previous event only. A number of algorithms have been devised that can find causality among items. The process of determining causal relationships can be broadly classified into two main categories: CCC and CCU triplets causality [104].

Both techniques are based on induction and statistical probability principles. Assume there are items: A, B and C. The CCC mechanism considers all three possible pairs of items as correlated and one variable is already known to have no cause. For example, in case of (A, B), (B, C) and (A, C) pairs, if A has no causes then the causal relation will be A causes B and remaining B causes C. On the other hand, the CCU mechanism is that only two pairs are in associative relationship, and one pair is not correlated. For example, if the pairs (A, B) and (A, C) are correlated and (B, C) pair is not in association, then B and C cause A. The CCU causality mechanism is deemed as better in performance, while the CCC is better in producing quality results.

Many techniques have been introduced in previous years that can discover causal re- lationships. Most of the algorithms are based on either Bayesian [105] or Constraint- based [106] approaches. Both approaches are based on Markov chain but differ in theory and practice. The Constraint-based approach can make categorical decisions about the conditional dependence and independence constraints among the data items, whereas the Bayesian approach weighs the degree in terms of probability to which these constraints hold. If two items are conditionally dependent (or not conditionally independent), it means they are in cause and effect relationship. Any Constraint-based approach is based on two steps: conduct statistical tests to determine conditional (in)dependence among the data items, and use the tests to define the types of causal connection that ex- ist among the items. This approach uses different kinds of statistical tests to reduce the complexity and complete the process in a reasonable time. Furthermore, the Constraint- based solution have been widely utilised in several real-world problems as they are more generalised. On the other hand, Bayesian solutions find all possible causal structures and use a probabilistic framework to determine if ‘Event A cause Event B’ ? This ap- proach requires user-specified probabilities in the graph structure. If it is not feasible for the user to provide this information, then non-informative priors can be used. For a large and complex observational data, Bayesian solutions will require high computation power to identify and process all causal structures. Moreover, the incomplete or hidden variables may also make it difficult to directly observe the data and produce accurate results.

One of the earlier research studies presented an algorithm, called Local causal discov- ery (LCD) [107], which uses a Constraint-based approach to determine pairwise causal

relationships in a completely observational data. The LCD algorithm uses CCC mech- anism (but not the CCU) to conduct independence tests between all data items. It uses a function, called Independent(A, B, C), which implicitly scans the entire set of relationships R to determine whether A is independent of B, given C. If found indepen- dent, then the function returns true, or else, it returns false. This function is applied to each pair of items and causal structures are represented as a graph. A recent study presented an algorithm, called Logical Causal Inference (LoCI), that conducts minimal conditional (in)dependent tests and then convert them into logical statements to form initial causal relations. After that, the algorithm combines all statements using aggrega- tion, elimination and basic properties of causality to output a complete causal structure. Another study presented a Constraint-based algorithm, called Markov blanket/collider set (MBCS*) [108], which starts by finding a Markov Blanket (MB) of each item to determine all items (parents and children) that shield this particular item from the rest of the network. So, instead of conducting conditional-independence tests for all pairs of items in the data to infer causal relationships, it only considers MB of each item and provides better results in terms of accuracy and efficiency. Several Constraint-based approaches are also developed for learning a single, collective causal structure from joint observational datasets. One simple algorithm for this purpose is Structure learning us- ing prior results (SLPR) [109]. The algorithm starts by finding the causal structure for each dataset individually and then combines them by removing common edges to get a full graph. After that, it reapplies causal inference mechanism to the graph and output a final causal structure.

A research work presented a method for discovering causation through Bayesian ap- proach by using a mixture of observed (deterministic) and experimental (non-deterministic) data [110]. The proposed model was tested on a manually created dataset of potential anaesthesia problems in the operating room that contains 37 nodes and 46 edges. An improvement made by this solution is the automatic assignment of initial probabilities, which are then used to build a Bayesian network. After that, it used joint probability distribution to successfully infer the causal structures and parameters among randomly selected item-pairs. A variant of LCD algorithm has also been proposed, called Bayesian Local Causal Discovery algorithm, which reduced the amount of false positives and in- creased efficiency [111]. It uses a Bayesian scoring metric instead of conditional inde- pendence testing. The results are modelled as a probabilistic graph of connected items,

known as Bayesian network. The Bayesian network satisfies Markov chain property and the availability of probability distribution allows the efficient inference between two ran- dom items. After that, it applies a heuristic-based greedy search in the Bayesian network to derive Markov Blanket (MB) for each node. By limiting the causal learning to the MB of each node, the algorithm significantly reduces the size of probability-based inference calculations, and thus have improved accuracy and efficiency. A similar research study presents two hybrid (Bayesian- and Constraint-based) algorithms CD-B and CD-H that can discover causal relationships from observational data [112]. The CD-B algorithm employs a greedy search to determine the MB of a node, and then use it in a scoring method to identify the parents and children. This process is repeated for every node and a global Bayesian network is constructed to discover all causal structures. On the other hand, the CD-H algorithm is based on a two-step mechanism to perform conditional dependence and independence tests and determine the parent-child relationships of all nodes. The rest of the process is similar to the CD-B algorithm.

There are also other causal mining techniques that are not based on Bayesian/Constraint- based approaches. An algorithm, called causal association rule discovery [113], presented a correlation mining approach to determine causal rules in observational data. It starts by extracting all association rules, and for each rule (A − B), it hypothesises that the left (A) causes right (B) item. After that, it applies retrospective cohort studies mecha- nism to find odds ratio for each corresponding hypothesis and discover persistent causal rules. Another study presents a Relational causal discovery algorithm [114], which uses automated relational blocking to determine causal relationships, instead of statistical tests. The relational blocking is essentially a manual technique that divides the data into unrelated groups based on certain criteria (blocking factor). In the proposed algo- rithm, groups are defined based on a graph structure that aims at reducing variation and adjusting for common causes. The groups are then used to determine causal struc- tures. However, unlike conditional (in)dependence tests, the relational blocking is not capable of inducing dependence when discovering common effects. Apart from the al- gorithms, certain mechanisms are developed to obtain evidence of a causal connection between a presumed cause and an observed effect, for example, the Bradford Hill [115] and Granger [116] models. They claim that a causal rule should satisfy the following factors: association, specificity, consistency, plausibility, temporal order, coherence and experimental evidence. Another approach to infer causality is presented by Popper,

which is based on three considerations: temporal precedence, dependency, and no hid- den variables [117]. This approach constructs incremental causal network and has been successfully exploited in the unbounded data streams of events to build causal networks.

Documento similar