PARTE II: MARCO METODOLÓGICO
3.3 Instrumento para la recogida de datos
This algorithm of Clark and Niblett’s was sketched earlier. It aims to modify the basic AQ algorithm of Michalski in such a way as to equip it to cope with noise and other complications in the data. In particular during its search for good complexes CN2 does not automatically remove from consideration a candidate that is found to include one or more negative example. Rather it retains a set of complexes in its search that is evaluated statistically as covering a large number of examples of a given class and few of other classes. Moreover, the manner in which the search is conducted is general-to-specific. Each trial specialisation step takes the form of either adding a new conjunctive term or removing a disjunctive one. Having found a good complex, the algorithm removes those examples it
74 Machine Learning of rules and trees [Ch. 5
covers from the training set and adds the rule “if <complex> then predict <class>” to the end of the rule list. The process terminates for each given class when no more acceptable complexes can be found.
Clark & Niblett’s (1989) CN2 algorithm has the following main features: 1) the dependence on specific training examples during search (a feature of the AQ algorithm) is removed; 2) it combines the efficiency and ability to cope with noisy data of decision- tree learning with the if-then rule form and flexible search strategy of the AQ family; 3) it contrasts with other approaches to modify AQ to handle noise in that the basic AQ algorithm itself is generalised rather than “patched” with additional pre- and post-processing techniques; and 4) it produces both ordered and unordered rules.
CN2 inputs a set of training examplesµ and output a set of rules called rule list. The core of CN2 is the procedure as follows, but it needs to use a sub-procedure to return the value of best cpx:
1. Let rule list be the empty list;
2. Let best cpx be the best complex found fromµ ; 3. If best cpx orµ is empty then stop and return rule list;
4. Remove the examples covered by best cpx fromµ and add the rule “if best cpx then class=m ” to the end of rule list wherem is the most common class of examples covered by best cpx; re-enter at step (2).
This subprocedure is used for producing ordered rules. CN2 also produces a set of unordered rules, which uses a slightly different procedure. To produce unordered rules, the above procedure is repeated for each class in turn. In addition, in step 4 only the positive examples should be removed.
The procedure for finding the best complex is as follows:
1. Let the set star contain only the empty complex and best cpx be nil; 2. Let selectors be the set of all possible selectors;
3. If star is empty, then return the current best cpx;
4. Specialise all complexes in star as newstar, which is the set º2«n¬
¼
«!ò star
~
y ò selectors¾ and remove all complexes in newstar that are either in star (i.e. the unspe- cialised ones) or are null (i.e.Ö ¸
¬ÂDɤÖB¸
kÊ ); 5. For every complex {
in newstar, if
{
is statistically significant (in significance) when tested on µ and better than (in goodness) best cpx according to user-defined criteria when tested on µ , then replace the current value of best cpx by
{
; remove all worst complexes from newstar until the size of newstar is below the user-defined maximum; set star to newstar and re-enter at step (3).
As can be seen from the algorithm, the basic operation of CN2 is that of generating a complex (i.e. a conjunct of attribute tests) which covers (i.e. is satisfied by) a subset of the training examples. This complex forms the condition part of a production rule “if condition then class =m ”, where class is the most common class in the (training) examples which satisfy the condition. The condition is a conjunction of selectors, each of which represents a test on the values of an attribute such as “weather=wet”. The search proceeds in both AQ and CN2 by repeatedly specialising candidate complexes until one which covers a large number of examples of a single class and few of other classes is located. Details of each search are outlined below.
Sec. 5.2] StatLog’s ML algorithms 75
The search for specialisations
The CN2 algorithm works in an iterative fashion, each iteration searching for a complex covering a large number of examples of a single classm and few of other classes. (The ex- amples of the current class are called “positive” examples and the other examples are called “negative” examples.) The complex must be both predictive and reliable, as determined by CN2’s evaluation functions. Having found a good complex, those examples it covers are removed from the training set and the rule “if ÿ
complexô
then class=m ” is added to the end of the rule list. This greedy process iterates until no more satisfactory complexes can be found.
To generate a single rule, CN2 first starts with the most general rule “if true then class=c” (i.e. all examples are class c), where c is the current class. Then CN2 searches for complexes by carrying out a general-to-specific beam search. The extent of the beam search for a complex can be regulated by controlling the width (i.e. number of complexes explored in parallel) of the beam. At each stage in the search, CN2 retains a size-limited set or starÄ of “complexes explored so far”. The system examines only specialisations of this set, carrying out a beam search of the space of complexes. A complex is specialised by adding a new conjunctive term in one of its selector. Each complex can be specialised in several ways, and CN2 generates and evaluates all such specialisations. The star is trimmed after completion of this step by removing its lowest ranking elements as measured by an evaluation function that we will describe shortly.
The implementation of the specialisation step in CN2 is to repeatedly intersect the set of all possible selectors with the current star, eliminating all the null and unchanged elements in the resulting set of complexes. (A null complex is one that contains a pair of incompatible selectors, for example,
). Search heuristics
There are two heuristics used in the search for the best complexes and both can be tuned by the user depending on the specific domain: the significance level and the goodness measure. Significance is an absolute threshold such that any (specialised) complexes below the threshold will not be considered for selecting the best complex (but they are still used for further specialisation in theÉ Ì¶Ã ). Goodness is a measure of the quality of the complexes so it is used to order the complexes that are above the significance threshold to select the best complex.
Several difference functions can be chosen to guide the search for a good rule in the CN2/AQ system, for example:
“Number of correctly classified examples divided by total number covered”. This is the traditional AQ evaluation function.
Entropy, similar to the information gain measure used by ID3 (Quinlan, 1986) and other decision tree algorithms.
The Laplacian error estimate:ÛmYm
ú ö¹m¬¶$ck ~ k Í ~ d g cBkK0k Í ed+ fDgih cked g where k is the total number of examples covered by the rule,k
Í is the number of positive examples covered by the rule andd is the number of classes in the problem.
The intersection of set with set is the set ! "
$#%'&
#(*) . For example, using ‘+’ to abbreviate ‘ ’,-,.+/0&1,+20&3/0+4-) intersected with5,&1/0&320&64-) is-,.+/0&1,.+/0+20&1,+/0+47&1,.+20&6,.+20+47&1/0+47&3/0+20+4-) . If we now remove unchanged elements in this set we obtain5,+/0+20&1,.+/0+47&1,.+20+47&1/0+20+4-) .
76 Machine Learning of rules and trees [Ch. 5
CN2 uses one of these criteria according to the user’s choice to order the goodness of rules. To test significance, CN2 uses the entropy statistic. This is given by:
³ À Á| ¥ logcó¥ h ù g ~
where the distribution¥ |.~DEDi~
¥ is the observed frequency distribution of examples among classes satisfying a given complex andù
|.~E½½½~
ù
is the expected frequency distribution of the same number of examples under the assumption that the complex selects examples randomly. This is taken as the w
R¥ covered examples distributed among classes with the same probability as that of examples in the entire training set. This statistic provides an information-theoretic measure of the (non-commutative) distance between the two distributions. The user provides a threshold of significance below which rules are rejected.
Missing values
Similar to NewID, CN2 can deal with unknown or don’t-care values. During rule generation, a similar policy of handling unknowns and don’t-cares is followed: unknowns are split into fractional examples and don’t-cares are duplicated.
Each rule produced by CN2 is associated with a set of counts which corresponds to the number of examples, covered by the rule, belonging to each class. Strictly speaking, for the ordered rules the counts attached to rules when writing the rule set should be those encountered during rule generation. However, for unordered rules, the counts to attach are generated after rule generation in a second pass, following the execution policy of splitting an example with unknown attribute value into equal fractions for each value rather than the Laplace-estimated fractions used during rule generation.
When normally executing unordered rules without unknowns, for each rule which fires the class distribution (i.e. distribution of training examples among classes) attached to the rule is collected. These are then summed. Thus a training example satisfying two rules with attached class distributions [8,2] and [0,1] has an expected distribution [8,3] which results inm | being predicted, or m | : m Õ hf2f : 8 h$ff if probabilistic classification is desired. The built-in rule executer follows the first strategy (the example is classed simply
m
|
).
With unordered CN2 rules, an attribute test whose value is unknown in the training example causes the example to be examined. If the attribute has three values, 1/3 of the example is deemed to have passed the test and thus the final class distribution is weighted by 1/3 when collected. A similar rule later will again cause 1/3 of the example to pass the test. A don’t-care value is always deemed to have passed the attribute test in full (i.e. weight 1). The normalisation of the class counts means that an example with a don’t-care can only count as a single example during testing, unlike NewID where it may count as representing several examples.
With ordered rules, a similar policy is followed, except after a rule has fired absorbing, say, 1/3 of the testing example, only the remaining 2/3s are sent down the remainder of the rule list. The first rule will cause 1/3 class frequency to be collected, but a second similar rule will cause 2/3 1/3 class frequency to be collected. Thus the fraction of the example gets less and less as it progresses down the rule list. A don’t-care value always
Sec. 5.2] StatLog’s ML algorithms 77
passes the attribute test in full, and thus no fractional example remains to propagate further down the rule list.
Numeric attributes and rules
For numeric attributes, CN2 will partition the values into two subsets and test which subset each example belongs to. The drawback with a na¨ıve implementation of this is that it requires³
f
evaluations wherek is the number of attribute values. Breiman et al. (1984) proved that in the special case where there are two class values it is possible to find an optimal split withk
f
comparisons. In the general case heuristic methods must be used. The AQ algorithm produces an unordered set of rules, whereas the version of the CN2 algorithm used inStatLogproduces an ordered list of rules. Unordered rules are on the whole more comprehensible, but require also that they are qualified with some numeric confidence measure to handle any clashes which may occur. With an ordered list of rules, clashes cannot occur as each rule in the list is considered to have precedence over all subsequent rules.
Relation between CN2 and AQ
There are several differences between these two algorithms; however, it is possible to show that strong relationships exist between the two, so much so that simple modifications of the CN2 system can be introduced to enable it to emulate the behaviour of the AQ algorithm. See Michalski & Larson (1978).
AQ searches for rules which are completely consistent with the training data, whereas CN2 may prematurely halt specialisation of a rule when no further rules above a certain threshold of statistical significance can be generated via specialisation. Thus, the behaviour of AQ in this respect is equivalent to setting the threshold to zero.
When generating specialisations of a rule, AQ considers only specialisations which exclude a specific negative example from the coverage of a rule, whereas CN2 considers all specialisations. However, specialisations generated by CN2 which don’t exclude any negative examples will be rejected, as they do not contribute anything to the predictive accuracy of the rule. Thus, the two algorithms search the same space in different ways.
Whereas published descriptions of AQ leave open the choice of evaluation function to use during search, the published norm is that of “number of correctly classified examples divided by total examples covered”. The original CN2 algorithm uses entropy as its evaluation function. To obtain a synthesis of the two systems, the choice of evaluation function can be user-selected during start of the system.
AQ generates order-independent rules, whereas CN2 generates an ordered list of rules. To modify CN2 to produce order-independent rules requires a change to the evaluation function, and a change to the way examples are removed from the training set between iterations of the complex-finding algorithm. The basic search algorithm remains unchanged. 5.2.8 ITrule
Goodman & Smyth’s (1989) ITrule algorithm uses a function called the§
-measure to rank the hypotheses during decision rule construction. Its output is a set of probability rules, which are the most informative selected from the possible rules depending on the training data.
The algorithm iterates through each attribute (including the class attribute) value in turn to build rules. It keeps a ranked list of the 9 best rules determined to that point of the
78 Machine Learning of rules and trees [Ch. 5
algorithm execution (9 is the size of the beam search). The
§
-measure of the9 th rule is used as the running minimum to determine whether a new rule should be inserted into the rule list. For each attribute value the algorithm must find all possible conditions to add to the left hand side of an over-general rule to specialise it. Or it may decide to drop a condition to generalise an over-specific rule. The rules considered are those limited by the minimum running§
-measure value, which prevents the algorithm from searching a large rule space.
Three points should be noted. First, ITrule produces rules for each attribute value. So it can also capture the dependency relationships between attributes, between attributes and classes and between class values. Secondly, ITrule not only specialises existing rules but also generalises them if the need arises. Specialisation is done through adding conditions to the left hand side of the rule and generalisation is done through dropping conditions. Finally, ITrule only deals with categorical examples so it generally needs to convert numeric attributes and discrete values.
Evaluation function: the§
-measure Let¶ be an attribute with values in the setº2¶
¼
f
~Dó½½~
ྠandÖ be an attribute with values inº2Ö ¼ f ~E½½½~ ¤¾ . The §
-measure is a method for calculating the information content ýOc¶ ¼ Ö Ö g
of attribute¶ given the value of attributeÖ
Ö . It is § c¶ ¼ Ö Ö g ü À Á| ¥¦c¶ ¼ Ö g à½kbc ¥¦c¶ ¼ Ö g ¥c¶ g ~ where¥c¶. ¼ Ö g
is the conditional probability of¶
¶ givenÖ Ö and¥¦c¶ g is the a priori probability of ¶
¶. These can normally be estimated from the (conditional) relative frequency of the value of¶ . When the distribution is uniform and the data set is sufficient such estimates can be reasonable accurate. The ITrule algorithms uses a maximum entropy estimator: ¥ áexk;e f áe õ e ex³ where á and õ
are parameters of an initial density estimate, k is the number of the (conditional) event¶
¶ in the data and
is the (conditional) total number of¶ . The average information content is therefore defined as:
§ Ó0:<;2c¶ ¼ Ö Ö g ¿ À DÁ| ¥cÖ g § c¶ ¼ Ö Ö g ¥cÖ g § c¶ ¼ Ö Ö g
The above is true because it takes into account the fact that the probabilities of other values ofÖ are zero. The first term¥cÖ
g
can be interpreted as a measure of the simplicity of the hypothesis that¶ is related toÖ. The second term
§ c¶ ¼ Ö Ö g
is equal to the cross-entropy of the variable¶ with the condition “¶ is dependent on the eventÖ
Ö
”. Cross-entropy is known to measure the goodness of fit between two distributions; see Goodman & Smyth (1989).
Rule searching strategy
ITrule performs both generalisation and specialisation. It starts with a model driven strategy much like CN2. But its rules all have probability attached from the beginning. So a universal rule will be
Sec. 5.3] Beyond the complexity barrier 79 IfÛàà { ÊDk ëÌiÊDkÉ Then { ú Ã.ÃYk'Ì { à¶¹ÉDÉ with probability f
To specialise a rule such as one with current
§ -value § Ó0:<; c¶ ¼ Ö Ö g it calculates the all possible values of§
Ó0:<;2c¶ ¼ Ö