RELATION CAUSAL IN RULES OF ASSOCIATION

(1)

RELATION CAUSAL IN RULES OF ASSOCIATION

Santiago Zapata Cáceres , Luis Escobar Ramirez

Department of Computer Science

Engineering Faculty

Metropolitan Technological University of Chile

Abstract: - Although techniques developed to discover causal relationships among controlled data exist (or experimental), it is not resolved as it can be made in solely observed data (captured and stored, and on which it is not possible to experience). to find causal relationships in observed data, methods have been used based on grafos, those that are excessively complex when they are applied in a typical great group of data

The theory of the causal inference in Mining of Data consists of two parts: the causal inference in statistical data (the one that contemplates the mathematical one, and the causal philosophy) and the part of Mining of Data that has to do with techniques that allow to select a group of observed data possible to be related causally. All this in big volumes of data stored in database.

The existent techniques of causal mining are based on more complex techniques of causal discovery in statistical data and in experimental data. This way starting from the causal philosophy and the mathematical one is possible to infer it also relates causal in big databases. It is necessary to highlight that the search of causation in databases is complex, in the sense that is not possible to leave of a hypothesis, however, being restricted to a subset of data through the use of the technique of mining of data denominated association rules one could obtain information of presumed causation among them.

Words Key: Data Mining, Causation, Association Rules , Extraction of Knowledge (KDD).

1 Introduction

1.1. The Problem

The interest in the association rules is that they offer the promise (or illusion) of causation, or at least they give relationships predictivas. However, the association rules only calculate the frequency of occurrences of unions among the attributes; They don't express a causal relationship. It would be of a lot of utility to be able to discover the causal relationships starting from association rules in the context of the mining of data.

1.2. Objectives and motivation of the

investigation

The objective of this investigation work is to carry out a revision from the relative concepts to the causation from diverse points of view, philosophical, mathematical, etc., for they have been gathered it different mathematical models and techniques used

to represent causal relationships and to judge when a causal relationship is valid or it is not it. For that which is wanted to propose inside the environment of the Mining of Data, a Technique of Mining of Causal Relationships through the post-prosecution of the results surrendered by the search of association rules.

(2)

2. INTRODUCTION TO THE MINING OF DATA AND EXTRACTION OF KNOWLEDGE IN BASES OF DATA

It is known as Discovery of Knowledge in Databases (KDD: Knowledge Discovery in Databases) to the search process again knowledge starting from data stored in databases. The process includes the intelligent analysis of the data and the steps that allow to give knowledge.

The mining of data refers to the techniques that allow to extract knowledge of a great volume of data stored in databases. In general the concept Mining of Data analysts of systems use it, statistical and specialists in computer administration, while KDD is used by learning specialists in computers and artificial intelligence.

Fundamentally Mining Dates it uses techniques to carry out prediction or description of events, which can be obtained in automatic form starting from the great volume of data stored in the databases. The prediction involves the use of attributes of a database to predict the future behavior of other values (to know a priori like unknown data will be presented) and that they are of interest for the users, and the description is related with to find patron that allow the users to interpret the values of the stored attributes (description of data and patterns).

The inherent objectives to the prediction and description is possible to achieve them through the one Mining it Dates, for they are applied it diverse technical, such as association rules, trees of decision, clustering (to contain), neuronal net, artificial intelligence, etc.

3. PHILOSOPHICAL ASPECTS OF THE CAUSATION

The habitual generalizations tell us how they are the things: The mountains are irregular. They form descriptive trials of the reality. Contrary to them, the causal generalizations explain to us the reason of the things: Why does it rain? Why is there sun? Why are there wars?

In the years of the old Greece you began to speak and to think philosophically regarding the cause concept. Aristóteles distinguished four types of causes:

1. the material cause (the matter of the painting: wings, cloth, tints, etc.)

2. the formal cause (it is the idea represented by the artist in the square)

3. the efficient cause (it is the artist's action on the cloth)

4. the final cause (it is the end for which the painting has been made).

Aristóteles [1] I believe the acquaintance and popular Aristotelian doctrine of the causes that it persisted in the official culture of occident until the Rebirth.

In the modern science, the formal causes and ends were left aside to be considered them outside of the reach of the process of scientific experimentation. That which meant to reach the conclusion that in the modern conception of the world, the matter is essentially the subject of the change, not that that a thing is made and that it persists. "Therefore, of Aristóteles 4 causes, the efficient cause is only worthy of scientific investigation" [2].

Galileo Galilei finds a double he/she devises of cause, as temporary succession and I eat rational necessity, for Hobbes (3), the cause will be considered as the sum of the antecedents of the effect. For Galileo the efficient cause is the necessary and enough condition for the appearance of something, that and not another should be called it causes to whose witnesses it always follows the effect and to whose elimination the effect disappears. The problem of the definition of Galileo is that it disables the cause concept, because then the causal analyses would be impossible, due to the infinity of presumably integral factors of the cause and the empiric test of the causal hypothesis would be equally impossible, because the suppression of anyone of the infinite factors will introduce a difference, and therefore, it would be necessary to take bill of an infinity of parameters.

The XVII century, picking up intellectual critic's process to the final causes and Aristotelian materials puts on end to this conception. For you Discard, the final cause will be reserved for the knowledge of God. Regarding the material, it will sustain that he/she won't be able to understand each other like cause of something their own matter [4].

(3)

Hume pointed out in its “Treaty of the Human Nature” and in the “Investigation on the human understanding" that an empiric exam of the causation shows that there is not a necessary connection between cause and effect, but a mere sequence in the time.

4. THE MATHEMATICAL ONE, THE CAUSAL PHILOSOPHY AND CAUSAL MINING

4.1. Causation and the Mathematical one

Entering in matter can say that to theory of the causal inference it unites three parts, two parts of mathematical and a philosophy part. The two parts mathematics are grafos directed acíclicos (DAGs) and the theory of the probability (focused in the conditional independence of variables), and the part philosophy involves the causation among variables.

One of the big problems that presents the search of causal relationships in an universe of data, keeps relationship with the selection of the subset of data that he/she promises to correspond to causal relationships.

The current algorithms that are used for the discovery in data frequently observed use correlation and independence probabilísticas to find possible causal relationships. For example, if two variables are statistically independent, one can affirm that among them there is not causal relationship.

Two focuses exist for the discovery of causation in observed data, these they are:

Based on rules (constrain-Based) and Focus based on nets Bayesianas.

The causal discovery in nets Bayesianas, consists on finding the most probable causal structure and the parameters corresponding to the structure. The first opposing point of complexity is the number of possible models. For single three variables twenty-five possible models exist. Clearly, not serious viable to enumerate all the included possible models for a small group of data. This problem, practically makes inviable apply the algorithm Bayesiano to observed data.

In general. The algorithms that are applied in the search of causation in observed data are the Based algorithms in rules (constrain-Based).

In this context it is possible to separate the theory in three parts: directed grafos, the probability, and the causation. The theory connects the causal structure to the probability. To infer from a group of statistical data a causal structure it is necessary to make a series of suppositions.

It is necessary to make notice that many exist other theories that allow to extract causal conclusions for statistical data, the Markoviana it has been chosen the objective since it is to explain in some sense the relationship among the mathematical one, the philosophy and the causal mining.

Markov outlines a method to determine causation in statistical data, which is based in the determination of dependence probabilística. Explaining this method, the relationship was shown among the mathematical one the causal philosophy.

A Dag is a completely abstract mathematical object. In the theory of causal inference, the DAGs has two different functions. In the first one they represent groups of the distributions of probability and in second o'clock they represent the causal structures. The way in that you/they represent the distributions of probability is given by the condition of Markov, in the which (the DAGs) he/she becomes a more useful graphic relationship: d-separation (Pearl 1988) it is a relationship among three combined disjuntos of vertexes in a directed grafo. The basic idea of the d-separation supposes to verify if a combined Z of vertexes blocks all the connections (of certain type) among X and Y in a grafo G. If it is this way, then X and Y they are d-separated by Z in G, that is to say, X and Y they have certain degree of independence.

(4)

notación for the independence presented by Phil Dawid (1979); X1 _ | | _ X3 | X2 means: X1 and X3 are independent conditioned X2.

It is important to highlight that while he/she is not given an interpretation to the DAGs, these they are only mathematical objects that can be connected to the distributions of probability in any way that it is wanted. The important thing is that in any way the DAGs allows to identify groups of independence (group of independent variables). When one gives a causal interpretation (causal philosophy) to a DAG, then we have that the d-separation is the connection between a causal DAG and the distributions of probability.

There is usually many different DAGs that represent the same group of relationships of independence exactly, and therefore, the same group of distributions.

In this context different algorithms have been developed that they allow to calculate d-separation for any grafo, and that they are also able to generate all the DAGs possible die a group I specify (and well-known) of relationships of independence (Figures 4.1).

Figures 4.1: Dags generated starting from a group of relationships of independent variables.

One of the algorithms is the denominated “Algorithm PC". Their entrance is a group of relationships of independence on a group of variables and its result is a group of DAGs on these variables that are its equivalent one d-separate, or Markoviano. When applying the algorithm PC on the relationship of independence X1 and X3 they are independent conditioned X2", one has that it is possible to generate other two DAGs that are equivalent d-separation, this one can see in the figure 4.2. The algorithm PC generates all the DAGs that are equivalent in d-separation.

Figures 4.2: DAGs generated by algorithm PC starting from the relationship of independence “X1

and X3 is independent conditioned X2."

If it is only considered the d-separation and the condition of Markov like only the mathematical one that connects to DAGs and the distributions of probability and the causation is not involved in anything. Then, it is had a mathematical theory to generate representations of structures of independence. However, if one interprets a Dag in causal form (with the causal philosophy) the condition of Markov and the d-separation are in fact the correct connection between the causal structure and the independence probabilística. In this context to the causal interpretation of the condition of Markov is denominated “Causal condition of Markov."

The DAGs that are interpreted causally is denominated causal grafos. For example, in a combined V of variables one has that S is a variable that represents the I inhabit of smoking, the variable Y it represents the fingers stained then by nicotine, and C the variable that represents the cancer illness to the lung, the following causal grafo (it Figures 4.3.) it represents the what a thing it can be the causal structure among these variables.

Figures 4.3: Causal Dag “inhabits of smoking and cancer to the lung"

(5)

causes of the cancer to the lung, for example, environmental pollution, asbestos inhalation, genetic factors, etc. 2) many variables that can be between a specified cause and their effect, for example, a bronchial problem could be among the “True” causal road of smoking to have cancer to the lung. However, these omissions don't have relationship with the characteristic that the previous grafo is complete or incomplete.

It is said that a causal grafo is complete in the sense that completes the following two characteristics: First, they should be present all the causes common of the specified variables and second, all the causal relationships should be represented among the specified variables.

The theory supposes a complete grafo in the sense that you/they should be present (included in the grafo) all the causes common of the specified variables. For example, if one observes the grafo of the example illustrated in the figure 4.3. it is possible to detect that it lacks the variable genetic factor that is cause (common) of the behavior of smoking and of the illness of the cancer. As in the grafo this cause common to the two variables has not been represented one can affirm that the causal grafo (it figures 4.3) it is not an exact representation of the causal structure among these three variables, that is to say, it is an incomplete grafo.

The causal grafo is also supposed to be complete in the sense that all the causal relationships among the specified variables are represented in the grafo. For example, the grafo of the Figure 4.3. doesn't have any directed arch of Y to S, if the stains of nicotine in the fingers were cause of the behavior of smoking this serious one a foundation to affirm that the grafo is incomplete or not “exact”.

4.2. The causation, the mathematical one and the mining of causation

In the search of causal relationships it is very important to have the possibility to manipulate the data, that is to say, to be able to manage the variables so that it is possible to observe the behavior of an or more when changes take place in the other variables.

The causal mining tries to generate efficient methods to discover the relationships causal observational

databases. To the search of causation among data of a database, he/she is denominated “Causal Mining or Causal Discovery."

On the other hand, it is important the search of causation in data observed since is possible to find in them causations that perhaps would be practically impossible to find them in experimental data. Therefore there is a great potential for the use of technical of search of causation in big databases.

The causal mining corresponds to the later stage to the Mining of Data, since through this the data are interpreted generated by the Mining of Data.

The theory of the causal inference in Mining of Data consists of two parts: the causal inference in statistical data (the one that contemplates the mathematical one, and the causal philosophy) and the part of Mining of Data that has to do with techniques that allow to select a group of observed data possible to be related causally. All this in big volumes of data stored in database.

(6)

the cessation increases and the delinquency increases, let us remember that statistically, it is said that the correlation not necessarily implies causation.

4.3.- Association rules

"The Discovery of association rules, is defined as the problem of finding all the rules of existent associations with a degree of trust and bigger support that the one specified respectively by the user, values denominated minconf and minsup" [5].

5. BASED ALGORITHMS IN LCD

The position of Silverstein [6] to mine built causal structures on the algorithm LCD[7] they maintain a complexity of time polinomial. These algorithms don't try to discover the complete causal structure as the algorithms IC and PC. The algorithm LCD finds causal structures in the form of chains and ramifications. Without adding complexity, he/she is an additional causal structure; the author names to this structure causation CCU. The causation CCU this represented by the structures to -> b <- c, and it is also known as the v-structures.

The authors use a called method it supports (it supports it means that a particular value should be about a certain frequency in the data). This restricts the test of causation to the article of more interest. A value of threshold of trust is used as a statistical level of trust to determine if two variables are dependent. The statistical test of chi-square (χ2) it is used to determine the dependence of two variables. The coefficient ß2 are calculated as: (O-E)/E where O it is the value observed for the variable and E it is the prospective value for the variable. If ß2 are bigger than χ2α where α to it is the level of trust of the test, then the two variables are dependent. Using statistic in combination with support and trust, the error in determining dependence is reduced.

Silverstein[6] it doesn't assume that in the case to -> b, to it is the direct cause. This allows that possible hidden variables exist. Other suppositions that would be desirable of eliminating are taken. Single data Booleanos is assumed without lost data. For example this method can be used in Data booleanos extracted of the basket of purchases.

When eliminating the graphic representation, this position is viable in terms computacionales. But, it

doesn't overcome the difficulty of identifying structures with certain degree of trust. Lately these alone methods assume certain statistical correlation in a particular orientation among the variables to decide yes a variable causes another. Usually, the temporary order of the variables is supposed known ahead of time. When the temporary order is ignored, Pearl [8] it introduces the notion of statistical time, to any order of variables that you/they coincide with the causal structure.

5.

Conclusion

5.1. You devise main of the causation in

Discovery of Knowledge it has more than enough databases.

The Mining of Data bill with techniques that allow in some measure to extract objects or information in the big volumes of data, however, these objects not necessarily represent knowledge.

The later prosecution of the results surrendered by the Mining of Data is one of the stages of the process of Discovery of Knowledge it has more than enough databases, and it consists on the prosecution of the objects of knowledge taken place in previous stages with the purpose of simplifying, to purify, to validate and to visualize the extracted knowledge.

The transformation of these objects in knowledge depends on the interpretation that the user carries out. In many cases the obtaining of knowledge comes off without bigger analysis of the objects, however, in other cases it requires of an exhaustive and complicated analysis, in these cases it is necessary to count with technical or tools of help in the later analysis to the Mining of Data.

(7)

The association rules indicate the force of the association of two or more attributes of data. The interest in the association rules is that they give the promise (or illusion) of causation, or at least of relationships predictivas. However the alone association rules calculate the frequency of occurrences of unions of two or more attributes of data; They don't express a causal relationship. If it was possible to discover the causal relationships, this serious very useful to discover knowledge. The objective of this work has been to explore the causation in the context of the Mining of Data.

The discovery of causation in observed data leans on in techniques that gather mathematical and philosophical elements. However, the problem not this resolved one. The algorithms for the discovery in observational data in general use correlation and independence probabilística. The application of these alone algorithms allows to affirm that two variables are statistically independent, and of it can make sure it that associate causation doesn't exist. However, the inverse thing is not necessarily true. Given this situation, the user of carrying out a work of very exhaustive manual analysis to be able to determine if causation exists among the association rules generated by the Mining of Data. In another way, when they are two variables that are correlated, like it can decide that a cause to the other one?

5.2. I work Future

In the development of this work certain matters have been commented that, although in some cases they have been resolved, they can improve, leaving open several lines of face investigation to the development of future works. Next it is explained an investigation proposal shortly.

To adapt and to apply (to prove) in a database it specifies the algorithm outlined by Silverstein (1998) Scaleable Techniques For Causal Mining Structures to find causal structures in observed data. The algorithm outlined by Silverstein (1998) to find causal structures, it uses the statistical test of chi-square (χ2) to determine the independence of two variables in the universe of observed data. This way obtains the subset of data observed on which it applies the algorithm of search of causation among the data.

The implementation of this adaptation could be carried out applying the technique of association rules on a certain database, selecting the rules that it overcomes a certain threshold and with them to mount a dígrafo to discuss which could correspond to causal relationships of the obtained rules and which not starting from the algorithm.

In this work it should be made I use factors of certainty in the search of association rules. For it to apply (to prove) the factor of certainty proposed by the Dr. Daniel Sánchez Fernández [5]. The proposed factor is able to detect the dependence degree or independence among antecedent and consequent of the rule. It is important to prove the factor of certainty proposed since by the Dr. Sánchez the other methods they are not usually capable to detect the dependence degree or independence among antecedent and consequent of the rule.

This work would allow to evaluate the utility that could obtain an organization in the search of causation in its databases, applying the technique of association rules in the selection of data to analyze with the algorithm proposed by Silverstein (1998) [6].

6. BIBLIOGRAPHY

[1] Aristóteles, “La Metafísica”.

[2] Mario Bunge. Causalidad. Editorial Universitaria, Buenos Aires, 1978.

[3] Hobbes, Thomas. 1996 (1651). Leviatán (México, Fondo de Cultura Económica).

[4] Jean Wahl. Introducción a la filosofía .Fondo Cultura Económica, México, 1954.

[5] Fernando Berzal, Ignacio Blanco, Daniel Sánchez. and María Amparo Vila, Measuring the accuracy and interest of association rules: A new framework, Department of Computer Science and Artificial Intelligence, University of Granada, E.T.S.I.I, March 2002

[6] C. Silverstein, S. Brin, et al. [1998] “Scaleable Techniques For Mining Causal Structures," Proceedings. 1998 International Conference Very Large Data Bases, NY, 594-605

[7] G. Cooper [1997] “A Simple Constraint-Based Algorithm for Efficiently Mining Observational For Causal Relationships” in Data Mining and Knowledge Discouvery, v 1, n 2, 203-224