• No se han encontrado resultados

AJUSTE SOCIAL VALORES

Frequent Pattern Mining (FPM) is a very popular data mining technique for finding useful patterns in data. Since it was introduced by [Agrawal et al., 1993], FPM has received a great deal of attention and abundant literature has been dedicated to this research (see [Han et al., 2007]).

In this chapter, we study the application of pattern mining in the supervised setting, where we have a specific class variable (the outcome) and we want to find patterns (defin- ing subpopulations of data instances) that are important for explaining and predicting this variable. These patterns are presented to the user in terms of if-then rules that are intuitive and easy to understand. Examples of such rules are: “If a patient smokes and has a positive family history, then he is at a significantly higher risk for lung cancer than the rest of the patients”. This task has a high practical relevance in many domains of science or business. For example, finding a pattern that clearly and concisely defines a subpopulation of patients that respond better (or worse) to a certain treatment than the rest of the patients can speed up the validation process of this finding and its future utilization in patient-management.

We use FPM to explore the space of patterns because it performs a more systematic search than heuristic rule induction approaches, such as greedy sequential covering [Clark and Niblett, 1989, Cohen, 1995, Cohen and Singer, 1999, Yin and Han, 2003]. However, the disadvantage of FPM is that it often produces a large number of patterns. Moreover, many of these patterns are redundant because they are only small variations of each other. This large number of patterns (rules) easily overwhelms the domain expert and hinders the process of knowledge discovery. Therefore, it is crucial to devise an effective method for selecting a small set of predictive and non-redundant patterns from a large pool of frequent patterns.

To achieve this goal, we propose the Minimal Predictive Patterns (MPP) framework. This framework applies Bayesian inference to evaluate the quality of the patterns. In addition, it considers the structure of patterns to assure that every pattern in the result offers a significant predictive advantage over all of its generalizations (simplifications). We present an efficient algorithm for mining the MPP set. As opposed to the widely used two-phase approach (see Section2.4.4), our algorithm integrates pattern selection and frequent pattern mining. This allows us to perform a lot of pruning in order to speed up the mining.

The rest of the chapter is organized as follows. Section 3.1 provides some definitions that will be used throughout the chapter. Section 3.2 describes the problem of supervised descriptive rule discovery. Section3.3describes the problem of pattern-based classification. Section3.4illustrates the problem of spurious patterns. Section3.5presents our approach for mining minimal predictive patterns. We start by defining a Bayesian score to evaluate the predictiveness of a pattern compared to a more general population (Section3.5.1). Then we introduce the concept of minimal predictive patterns to deal with the problem of spuri- ous patterns (Section3.5.2). After that, we present our mining algorithm and introduce two effective pruning techniques (Section3.5.3). Section3.6presents our experimental evalua- tion on several synthetic and publicly available datasets. Finally, Section 3.7summarizes the chapter.

3.1 DEFINITIONS

We are interested in applying pattern mining in the supervised setting, where we have a special target variable Y (the class variable) and we want to find patterns that are important for describing and predicting Y . In this chapter, we focus on supervised pattern mining for relational attribute-value data D = {xi, yi}ni=1, where every instance xi is described by a fixed number of attributes and is associated with a class label yi∈ dom(Y ). We assume that all attributes have discrete values (numeric attributes must be discretized [Fayyad and Irani, 1993,Yang et al., 2005]). As we discussed in Section2.1, the data can be converted into an equivalent transactional format.

We call every attribute-value pair an item and a conjunction of items an itemset pat-

tern, or simple a pattern. A pattern that contains k items is called ak-pattern (an item

is a 1-pattern). For example, Education = PhD ∧ Marital-status = Single is a 2-pattern. Pattern P is a subpattern of pattern P0, denoted as P ⊂ P0, if every item in P is con- tained in P0and P 6=P0. In this case, P0is a superpattern of P. For example, P1: Education = PhD is a subpattern of P2: Education = PhD ∧ Marital-status = Single. The subpattern (more-general-than) relation defines a partial ordering of patterns, i.e. a lattice structure, as shown in Figure3.

Figure 3: The box on the left shows the set of all patterns and the box on the right shows the set of all instances. Each pattern is associated with a group of instances that satisfy the pattern. The patterns are organized in a lattice structure according to the subpattern- superpattern relation.

Instance xi satisfies pattern P, denoted as P ∈ xi, if every item in P is present in xi. Every pattern P defines a group (subpopulation) of the instances that satisfy P: GP = {(xi, yi) : xi ∈ D ∧ P ∈ xi}. If we denote the empty pattern by φ, Gφ represents the entire

data D. Note that P ⊂ P0(P is a subpattern of P0) implies thatGP⊇ GP0 (see Figure3). The support of pattern P in dataset D, denoted as sup(P, D), is the number of instances in D that satisfy P (the size of GP). Given a user defined minimum support thresholdσ, P is called a frequent pattern if sup(P, D) ≥ σ.

mining rules that predict the class variable. Hence, a rule is defined as P ⇒ y, where P (the condition) is a pattern and y ∈ dom(Y ) (the consequent) is a class label. We say that P ⇒ y is a subrule of P0⇒ y0 if P ⊂ P0and y = y0.

A rule is usually assessed by its coverage and confidence. The coverage of P ⇒ y, de- noted as cov(P ⇒ y), is the proportion of instances in the data that satisfy P. The confidence of P ⇒ y, denoted as conf (P ⇒ y), is the proportion of instances from class y among all the instances that satisfy P. By using Dyto denote the instances in D that belong to class y:

conf (P ⇒ y) =sup(P, Dy) sup(P, D)

We can see that the confidence of P ⇒ y is the maximum likelihood estimation of Pr(Y = y|GP). Intuitively, if pattern P is predictive of class y, we expect conf (P ⇒ y) to be larger that the prior probability of y in the data.