1.6. ESTRATEGIAS O TIPOS DE MANTENIMIENTO
1.6.6. RCM MANTENIMIENTO CENTRADO EN CONFIABILIDAD
of Imbalanced Databases
The application of association rule mining to classication has led to a new family of classiers which are often referred to as Associative Classiers (ACs). The advantage of using rule based approaches is that they are easy to interpret and perform a global search, compared to many other rule based approaches that use a greedy search strategy.
Rule-based classiers can play an important role in applications such as medical diagnosis and fraud detection where data sets are almost always imbalanced. The focus of this chapter is to extend ACs for classication on imbalanced data sets using statistics based techniques.
This work combines the use of statistically signicant rules with a new mea- sure, the Class Correlation Ratio (CCR), to build an AC called SPARCCC. A detailed set of experiments show that in terms of classication quality, SPARCCC performs comparably on balanced data sets and greatly outper- forms other AC techniques on imbalanced data sets. It also has a signif- icantly smaller rule base and is more computationally ecient than tradi- tional support-condence based associative classiers.
158 8.1. INTRODUCTION
8.1 Introduction
Since the introduction of CBA [61] many variations on Associative Classiers (ACs) have been proposed in the literature [60, 13, 110, 100, 28, 30, 15,89]. Most of the ACs are based on rules discovered using the support-condence paradigm and the classier itself is a collection of rules ranked using condence or variation thereof. In many application domains, the data sets are imbalanced, i.e., the proportion of samples from one class is much smaller than the other class(es). Additionally, the smaller class is the class of interest. For example; fraud detection and medical diagnoses. Unfortunately, the support-condence framework does not perform well in such cases.
Many of the rules mined using support-condence are spurious and are irregularities in the data rather than properties of the underlying population or process, motivating the statistically signicant rules proposed by Webb [103]. The same holds true of rules used for classication. It is also well known that condence has non-intuitive properties in imbalanced data sets. For example, high condence rules can also be negatively correlated. This chapter combines statistically signicant rules with a new measure, the Class Correlation Ratio (CCR), which leads to a better classier. Furthermore, the proposed method does not use the support-condence paradigm.
8.1.1 Contributions
This chapter makes the following contributions:
• It proposes the Class Correlation Ratio (CCR), which measures the relative class correlation of a rule. A high CCR is desirable because it means the rule is more positively correlated with the class it predicts than the alternative(s). CCR also forms the basis of an eective rule ranking method that does not require condence.
• It proves that condence and support are biased toward the majority class
in imbalanced data sets in the context of CCR. This result also motivates a correction for condence's bias, and is a key component in making the classier perform well on both balanced and imbalanced data sets.
• An associative classier is proposed that is based on statistical techniques. The
CHAPTER 8. CLASSIFICATION OF IMBALANCED DATABASES USING
SIGNIFICANT RULES 159
lated Classication (SPARCCC) because it uses only rules that are statistically signicant and positively associated, and where the antecedent is more corre- lated with the class it predicts than with the other class(es). It also searches directly for signicant rules and uses this to prune the search space. SPARCCC outperforms support-condence based associative classiers on balanced data sets in terms of computational performance, and on imbalanced data sets in both computational and classication performance. SPARCCC is parameter- free, in the sense that it does not use thresholds except standard levels of signicance to prune rules. Finally, since the rules are statistically signicant and relatively class correlated, they may be examined for insights into the data.
8.1.2 Organisation
The the remainder of this chapter is organised as follows: Section 8.2 gives a brief background in associative classication. Section 8.3 describes the class correlation ratio and the signicance test used. Section8.4proves that condence (and support) are biased against the minority class under CCR. Section8.5describes the SPARCCC technique. Section 8.7 contains experimental results. Related work is surveyed in section8.8 and this chapter concludes in section8.9.
8.2 Background: Associative Classication
8.2.1 Association Rule MiningIn Association Rule Mining (ARM), the data is a set of transactionsT ={t1, ..., t|T|},
each of which is a subset of the set of items: ti ⊆I, I ={i1, ..., i|I|}. The support
of an itemset X ⊆ I is sup(X) = |{ti : X ⊆ ti∧ti ∈ T}|. An association rule X → Y is an implication between two mutually exclusive itemsets X and Y. The support of X → Y is sup(X → Y) = sup(X∪Y) and its conf idence, an estimate of the probability that Y occurs given that X occurs, isconf(X →Y) =sup(X →
Y)/sup(X).
8.2.2 Associative Classication
This chapter assumes a discrete data set D with attributes A = {a1, a2, ..., a|A|},
one of which is the class attribute ac. In every instance d ∈ D, each attribute ai ∈ A takes one of a nite number of possible values Vi = {vi,1, ..., vi,|Vi|} so that
d = [v1,j, v2,k, ..., v|A|,l] (for some j, k, ..., l). As an ARM task, the attribute-value Florian Verhein
160 8.3. SIGNIFICANCE AND CLASS CORRELATION RATIO FOR RULES pairs become items (Namely,i|V1|+...+|Vi−1|+j ≡(ai =vi,j)) and the instances become
corresponding transactions. The previous instance d then becomes a transaction t={(a1=v1,i),(a2 =v2,j), ...,(a|A|=v|A|,k)}. Clearly, there will be
P|A|
i=1|Vi|=|I| items and each transaction will have size |A|. Since the described mapping is a
bijection, one can freely interchange instances and transactions when convenient.
8.2.3 Associative Classication Rule Mining
The Associative Classication Rule Mining (ACRM) task is to nd interesting rules X→ywhereXis a set of legal (an attribute cannot occur more than once) attribute- value pairs andyis one of the class attribute-value pairs. Interesting rules are rules that, in conjunction with other mined rules, are likely to perform well for classication of unseen data.
8.3 Signicance and Class Correlation Ratio for Rules
8.3.1 Fisher's Exact TestThere are strong arguments for mining statistically signicant rules [103]. These also hold true when the rules are used for classication, as one would like to make a decision based on signicant evidence.
Support is often used as a measure of signicance, the reasoning being that rules that have high support are intuitively more likely to capture the underlying process generating the data, rather than being artifacts of the data set or generated by noise. However, this is simply not the case and one can easily generate counterexamples showing insignicant high support or signicant low support rules.
This work considers rules X → y that are statistically signicant in the positively associated direction. Toward that end, Fisher's Exact Test (FET) is used on contin- gency tables of the form of gure8.11.
1Statistical tests on such tables determine whether there is a signicant association between
the variables, compared to the null hypothesis of no association. If the sampling scheme is such that only the total (n) is xed (or it is unrestricted), then the null hypothesis is that the variables
are independent of each other, in the sense that the probability of falling into a particular row is independent of the column a particular subject is in, and vice versa (This symmetry means that the tests give the same result for [a, b;c, d] as they do for [a, c;b, d].) [31]. For example,
P(V1, V2) = P(V1)·P(V2). Statistical tests compute the probability (thep-value) of obtaining a table at least as unusual as the observed table. If thep-value is below a level of signicance, then
there is assumed to be sucient evidence to reject the null hypothesis and therefore we can say with some condence level that the variables are correlated.
CHAPTER 8. CLASSIFICATION OF IMBALANCED DATABASES USING
SIGNIFICANT RULES 161
FET is an exact test (permutation test) that computes the p-value of an observed contingency table by explicitly calculating the probability of dierent table congu- rations, rather than using an approximate or limiting distribution. This work uses the positive one sided FET to test whether rules are signicant in the positive direc- tion. Given a table[a, b;c, d], FET will nd the probability (p-value) of obtaining the given table or a table where X and y are more positively associated under the null hypothesis that {X,¬X} and {y,¬y} are independent, and that the margin sums
are xed. Thep-value is given by:
p([a, b;c, d]) = min(b,c) X i=0 (a+b)!(c+d)!(a+c)!(b+d)! n!(a+i)!(b−i)!(c−i)!(d+i)!
Only rules whosep-values are below the level of signicance desired are used, as they are statistically signicant in the positively associated direction.
FET's continuous approximation theχ2 test could also be used, but since it is a two sided test it cannot distinguish positive associations and is thus less desirable.
8.3.2 Correlation (Interest Factor)
Correlation also forms an important component of the technique in this work. This chapter proposes that rulesX →yshould be used when X is more positively corre- lated with y than it is with ¬y. The following denition of correlation2 is used:
ˆ
corr(X→y) = sup(X∪y)· |D| sup(X)·sup(y) =
a·n (a+c)·(a+b)
X and y are positively (negatively) correlated if corrˆ (X → y) > 1 (< 1), and
independent otherwise. Note that corrˆ (X → y) = I(X, y), where I(X, y) is the
Interest Factor [88]. This measure has downsides when used by itself. It is clear to see that increasing the size of the data set by increasing d (refer to gure 8.1) will increase the correlation betweenX andy even though it is actually increasing the association between ¬X and ¬y. The reverse holds for decreasing d.
Example 8.1. Consider the table T1 = [100,20; 20,10] where X and y are have a strong association but corrˆ (X → y) = 1.04 (almost independent). If dis increased to get T2 = [100,20; 20,200], then clearly ¬X and ¬y are strongly associated, but
ˆ
corr(¬X→ ¬y) = 1.4while now corrˆ (X→y) = 2.36. This is clearly undesirable.
2To be more precise,corrˆ (X →y)this is the estimate ofcorr(X →y) = P(X∪y⊆t)
P(X⊂t)·P(y∈t), where
corr(X →y)is dened over the underlying process that generates the data.
162 8.3. SIGNIFICANCE AND CLASS CORRELATION RATIO FOR RULES This problem arises only in imbalanced data sets; note that changing d alters the class distribution.
Therefore, SPARCCC does not search for positively correlated rules using corrˆ .
When a rule is described as being positively associated or correlated, the author means using the one sided test of signicance using FET. FET does not have the downside described above because of the constant margin sum restriction. Indeed, p(T1) = 0.041 (signicant at the0.05 level) andp(T2) = 1.07·10−44 (highly signi- cant).
8.3.3 Class Correlation Ratio
SPARCCC usescorrˆ (·)to measure how correlatedX is withycompared to¬y using the proposed Class Correlation Ratio (CCR):
Denition 8.2. The Class Correlation Ratio (CCR) is dened as: CCR(X →y) = corrˆ (X→y)
ˆ
corr(X → ¬y) =
a·(b+d) b·(a+c)
The CCR measures how much more positively the antecedent is correlated with the class it predicts, relative to the alternative class(es). This avoids the downsides of using an absolute correlation measure indeed, terms cancel out. Furthermore, intuitively one would not want to use a rule that is more correlated with classes other than the one it predicts!
Example 8.3. Returning to Example8.1,CCR(X→y) = 1.25forT1andCCR(X → y) = 9.17 for T2. This also says that X → y is a better rule underT2 than under T1. This is true it is much more discriminative because under T1,y is already the majority class and therefore the rule does not provide much additional information. In fact, the information gain of usingX→y over ∅ →y is only0.072bits under T1 but is0.215bits underT2. Recall also that the rule was much more signicant under T2.
SPARCCC uses only rules with CCR >1, so that no rules are used that are more
positively associated with the classes they do not predict. Furthermore,CCRis also used in the strength score which used to rank rules for classication in order to correct for the bias of condence. This will be covered shortly.