CAPÍTULO VI. CONCLUSIONES
5. Conclusiones
We have used the proposed False-Positives criterion, in order to construct deci- sion trees both from a real-world data set and from test databases supplied by the UCI Machine Learning Repository [1]. The real-world data set comes from the domain of Molecular Biology, especially that of DNA sequence analysis, namely the “promoter recognition” problem [15], having 106 instances described by 58 attributes. Notice that DNA sequence analysis problems are used as bench- marks for comparing the performance of learning systems. The test databases are “Mushrooms”, “MONK” and the “1984 United States Congressional Voting Records data sets supplied by [1]. The “Mushrooms” data set, having 8124 in- stances described by 23 attributes, classifies mushrooms as poisonous or edible,
180 Basilis Boutsinas and Ioannis X. Tsekouronas
in terms of their physical characteristics. The MONK’s Problem data set, having 432 instances described by 8 attributes, describes an artificial domain over the same attribute space. It was the basis of a first international comparison of learn- ing algorithms. The “1984 United States Congressional Voting Records data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the Congressional Quarterly Almanac.
We compared the results obtained by the False-Positives criterion with those obtained by the Gain criterion. A first measure of comparison is the number of times that the two criteria agree on what attribute to choose. Instead of calculating this measure using a great number of different data sets, we used 100 different subsets of each of the “promoter recognition” and “Mushrooms” data sets. All subsets of each data set have the same size, that is carefully chosen. If the size is large enough, then always the same attribute is chosen by both criteria. If it is small enough, then there would be no remarkable classification accuracy to be observed. The size of subsets of “promoter recognition” and “Mushrooms” data sets is chosen to be 64 and 25, respectively. The two criteria agree in 62% cases for the “promoter recognition” data set and 77% cases for the “Mushrooms” data set.
A second measure of comparison concerns the classification accuracy of de- cision trees obtained by the two criteria. Using four different subsets of each the above data sets, we built four decision trees for each one of the two criteria. We used the standard steps of the ID3 algorithm to build the decision trees: Step 1 Select the best test/attribute as the root. Make branches for all different
values the selected test/attribute can have;
Step 2 If all instances at a particular leaf node belong to the same class, this leaf node is labelled with this class. If all leaves are labelled with a class the algorithm terminates;
Step 3 otherwise, the node is labelled with the best test/attribute that does not occur on the path to the root. Make branches for all different values the selected test/attribute can have. Continue with Step 2.
The best test/attribute is selected according to the Gain and False-Positives criteria. As far as the False Positive criterion is concerned, given as input the instances included in a node, the algorithm of the previous section:
1 2
3
4 5
enumerates the positive and negative instances for each possible value of every attribute;
calculates the true-positive (TP) and the false-positives (FP) values for each possible value of every attribute by setting TP and FP the maximum of the numbers of positive and negative instances respectively;
calculates the N A value, through the calculation of NV = TP – FP and then the sum of the NV values of every possible value of each attribute. It also calculates the MaxNA and MaxNV values;
calculates the FPC (False Positive Criterion) value for each attribute; outputs the test/attribute with the highest FPC value.
Splitting Data in Decision Trees Using the New False-Positives Criterion 181
Then we measured the classification accuracy obtained by each of the eight decision trees on the same test set. The mean classification accuracy for each criterion applied to the four data sets is shown in Table 1.
5
Conclusion
We presented a new splitting criterion for constructing Decision Trees, the False- Positives criterion. Based on the presented experimental tests, we can conclude that the proposed criterion is almost so accurate as the Gain criterion. Thus, the proposed False-Positives criterion is rather a minimal improvement of the very famous Gain criterion, as far as the classification accuracy is concerned.
The time complexity of selecting the best test/attribute using the Gain cri- terion is O(XN), where X is the number of instances and N is the number of attributes in the training set. The same time complexity has the step of selecting the best test/attribute using the proposed criterion. Also, in order to build a de- cision, there are calculations, in the worst case, where B is the maximum number of possible values for an attribute. However, in the case of the Gain cri- terion, these calculations are logarithmic calculations. Since
[14], the Gain criterion is based on extended logarithmic calculations, especially when deep decision trees are going to be constructed. Thus, the proposed crite- rion has a better time complexity, due to lack of logarithmic calculations.
We are currently working on extending the proposed criterion, in order to remove its preference to tests/attributes with the greater number of possible values, analogously to Gain ratio criterion. Of course, one can always attack this problem removing such attributes in a preprocessing phase.
We are also investigating the type of training data for which the proposed criterion outperforms other criteria. Thus, we can detect certain type of applica- tions where the proposed criterion can be used instead of other known criteria. We are also working on improving the proposed criterion through a heuris- tic improvement of its parameters, like the Countzero parameter. However, al- though such heuristics may improve the accuracy, the basic strategy, presented in this paper, performs well on numerous different data sets.
References
Blake C. L., Merz C. J.: UCI Repository of machine learning databases.
[http://www.ics.uci.edu/~mlearn/MLRepository.html], Irvine, CA: University of California, Department of Information and Computer Science.
182 Basilis Boutsinas and Ioannis X. Tsekouronas 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Boutsinas B., Vrahatis M.N.: Artificial Nonmonotonic Neural Networks. Artif. In- telligence, Elsevier Science Publishers B.V., 132(1), (2001) 1–38.
Breiman L., Friedman J. H., Olshen R. A., C. J. Stone: Classification and Regres- sion Trees. Wadsworth & Brooks, Calif., (1984).
W. Buntine: Graphical Models for Discovering Knowledge. In Fayyad U.M., Piatetsky-Shapiro G. and Smyth P., editors, Advances in Knowledge Discovery and Data Mining, (1996) 59–82.
Clark P., Niblett T.: The CN2 Induction Algorithm. Machine Learning, 3(4), (1989) 261–283.
Cost S., Salzberg S.: A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning, 10, (1993) 57–78.
Dzeroski S.: Inductive Logic Programming and Knowledge Discovery in Databases. In U. M. Fayyad U.M., Piatetsky-Shapiro G. and Smyth P., editors, Advances in Knowledge Discovery and Data Mining, (1996) 117–152.
Fayyad U.M., Piatetsky-Shapiro G., Smyth P.: Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, (1996).
Friedman J. H.: Multiple Adaptive Regression Splines. Annals of Statistics, 19, (1991) 1–141.
Muggleton S.: Inductive Logic Programming, vol. 38 of A.P.I.C. series, Academic Press, London, (1992).
Quinlan J.R.: Induction of Decision Trees. Machine Learning, 1, (1986) 81–106. Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, CA, (1993).
Rumelhart D. E., Hinton G. E., Williams R. J.: Learning internal representations by error propagation. In D.E. Rumelhart and J.L. McClelland, editors, Parallel Dis- tributed Processing: Explorations in the Microstructure of Cognition, MIT Press, (1986) 318–363.
Utgoff P. E.: Incremental Induction of Decision Trees. Machine Learning,4, (1989) 161–186.
Watson J.D., Hopkins N.H., Roberts J.W., Steitz J.A., Weiner A.M.: Molecular Biology of the Gene. Benjamin Cummings: Menlo Park, 1, (1987).