Ventajas para el alumno
11. VIABILIDAD EN EL FUTURO
In this chapter, we have established nugget discovery as a data mining task in its own right. We have shown that it needs to be guided by a good measure of interest for nugget discovery. We have looked at the properties of a nugget that should be incorporated into a measure of interest. We have presented a partial ordering of rules that should be enforced by measures of interest for nugget discovery. We have then looked at the measures of interest that have traditionally been used by classification (complete or partial) algorithms. We have established that many do not uphold the partial ordering and hence they are not suitable for ordering and selecting nuggets. The fitness measure, a measure which establishes the partial ordering and allows for the variation of the search criteria towards more
accurate or more general rules, has been established.
The heuristic algorithms for nugget discovery have been reviewed. Other algorithms, which can also be used for this task, were briefly introduced. They use the different measures of interest reviewed to produce complete or partial classifications.
Using four databases from the UCI repository, nugget discovery has been performed using all the algorithms presented. In each case, the fitness measure has been used to choose the most interesting nuggets from the set of rules produced by the algorithms. The examination of the nuggets obtained has established that algorithms such as C5, Brute and KnowledgeSEEKER are capable of obtaining good nuggets using their own guiding criteria for the search, and the fitness measure to choose the most interesting nugget from the set obtained. The heuristics have produced the best overall results, and since it is the delivery of interesting nuggets which is the guiding search criteria, they are the best choice for this task.
Tabu Search has shown great potential, and more sophisticated implementations of Tabu Search for data mining must be the focus of future research. Some further work, in particular using the Simulated Annealing algorithm, is also being implemented in the commercial data mining toolkit Datalamp (Howard, 1999) (http://www.datalamp.com). Adaptation of the techniques to databases with many missing values is also an area of research for the future. Algorithms that search for all conjunctive rules that are best according to some criteria of accuracy and applicability (Bayardo & Agrawal, 1999) have been proposed for categorical data. This is a promising area of research, which would also provide a good benchmark to analyse the performance of the heuristic algorithms, so some of the research efforts in our group have been directed to an all-rules search algorithm.
REFERENCES
Agrawal, R., Imielinski, T., and Swami, A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914-925. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, I. (1996). Fast discovery
of association rules. In Fayyad et al., 307-328.
Ali, K. Manganaris, S. and Srikant, R. (1997). Partial classification using association rules. In Heckerman, Mannila, Pregibon and Uthurusamy, 115-118.
Auer, P., Holte, R., and Maass, W. (1995). Theory and application of agnostic PAC-learning with small decision trees. In Prieditis and Russell, 21-29.
Bayardo R. J. (1997). Brute force mining of high-confidence classification rules. In Heckerman et al., 123-126.
Bayardo, R. J. and Agrawal, R. (1999). Mining the most interesting rules. In Chaudhuri and Madigan, 145-154.
Bayardo, R. J., Agrawal, R., and Gunopulos, D. (1999). Constraint-based rule mining in large, dense datasets. In Proc. of the 15th Int. Conf. On Data Engineering, 188-197.
Biggs, D., de Ville, B., and Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, 18(1), 49-62. Brin, S., Rastogi, R., and K. Shim. (1999). Mining optimized gain rules for numeric
Clark, P. C. and Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Y. Kodratoff (Ed.), Machine Learning – Proc. of the Fifth European Conf. Berlin: Springer-Verlag, 151-163.
Clark, P. C. and Niblett, T. N. (1989). The CN2 induction algorithm. Machine Learning, 3(4), 261-283.
Chaudhuri, S. and Madigan, D., (Ed.).(1999). Proceeding of the 5th ACM SIGKDD Int. Conf.
On Knowledge Discovery and Data Mining. New York: ACM.
Cohen, W. W. (1995). Fast effective rule induction. In Prieditis and Russell, 115-123. de la Iglesia, B. (2001). The development and application of heuristic techniques for the data
mining task of nugget discovery. PhD Thesis, University of East Anglia.
de la Iglesia, B., Debuse, J. C. W. and Rayward-Smith V. J. (1996). Discovering knowledge in commercial databases using modern heuristic techniques. In E. Simoudis, J. W. Han, and U. M. Fayyad (Ed.). Proceeding of the Second Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 44-49.
de Ville, B. (1990). Applying statistical knowledge to database analysis and knowledge base construction. In Proc. Of the 6th IEEE Conf. On Artificial Intelligence Applications.
Washington: IEEE Computer Society, 30-36.
Debuse, J. C. W., de la Iglesia, B., Howard, C. M., and Rayward-Smith, V.J. (2000). Building the KDD Roadmap: A Methodology for Knowledge Discovery. In R. Roy (Ed.). Industrial Knowledge Management, London: Springer-Verlag, 170-196.
Domingos, P. (1995). Rule induction and instance-based learning: A unified approach. In Proc. Of the 14th Int. Joint Conf. on Artificial Intelligence.
Domingos, P. (1996). From instances to rules: A comparison of biases. In Proc. Of the 3rd
Int. Workshop on Multistrategy Learning, 147-54.
Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery: An overview. In Fayyad et al., 1-34.
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R., (Ed.) (1996). Advances in Knowledge Discovery and Data Mining. California: AAAI Press/ MIT Press. Frank, E. and Witten, I. H. (1998). Generating accurate rule sets without global optimization.
In Proc. Of the Int. Conf. on Machine Learning. Morgan Kaufmann, 144-151. Fukuda, T. Morimoto, Y., Morishita, S. and Tokuyama, T. (1996). Data mining using two-
dimensional optimized association rules: schemes, algorithms and visualisation. In Proc. Of the ACM SIGMOD Conference on Management of Data, 3-26.
Heckerman, D., Mannila, H., Pregibon, D. and Uthurusamy, R. (Eds) (1997). Proceedings of the Third Int. Conf. on Knowledge Discovery and Data Mining. California: AAAI Press.
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1), 63-91).
Howard, C. M. (1999). DMEngine Class Reference. SYS Technical Report SYS-C99-03, University of East Anglia.
International Business Machines. (1997). IBM Intelligent Miner. User’s Guide, Version 1, Release 1.
Liu, B., Hsu, W. and Ma, Y. (1998). Integrating classification and association rule mining. In Agrawal, R. and Stolorz, P. (Ed.). Proceedings of the Fourth Int. Conf. On Knowledge Discovery and Data Mining. California: AAAI Press, 80-86.
Lundy, M. and Mees, A. (1986). Convergence of an annealing algorithm. Mathematical Programming, 34, 111-124.
Discovery. In Expert Systems XI.
Mann, J. W. (1996). X-SAmson v1.5 developers manual. School of Information Systems Technical Report, University of East Anglia, UK.
Merz, C. J. and Murphy, P. M. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences. http// www.ics.uci.edu/~mlearn/MLRepository.html.
Morimoto, Y., Fukuda, T., Matsuzawa, H., Tokuyama, T. and Yoda, K. (1998). Algorithms for mining association rules for binary segmentations of huge categorical databases. In Proc. Of the 24th Very Large Data Bases conference, 380-391.
Nevill-Manning, C., Holmes, G., and Witten, I. H. (1995) The development of Holte’s 1R classifier. In Proc. Artificial Neural Networks and Expert Systems,Dunedin, NZ 239- 242.
Piatetsky-Shapiro, G. (1991) Discovery, Analysis, and Presentation of Strong Rules. In Knowledge Discovery in Databases, (Chapter 13). California: AAAI/MIT Press. Prieditis, A. and Russell, S. (Ed.) Proc. Of the 12th International Conf. On Machine Learning.
Tahoe City, CA: Morgan Kaufmann Publishers, Inc.
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. Reprinted in J. W. Shavlik and T. G. Dietterich (Ed.), Readings in Machine Learning. San Mateo, CA: Morgan Kaufmann, (1991). Reprinted in B. G. Buchanan, and D. Wilkins (Ed.), Readings in Knowledge Acquisition and Learning. San Mateo, CA: Morgan Kaufmann (1992).
Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-Machine Studies, 27, 221-234.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Rastogi, R. and Shim, K. (1998). Mining optimised association rules with categorical and numeric attributes. In Proc. Of the 14th Int. Conf. On Data Engineering, 503-512.
Rayward-Smith, V., Debuse, J. , and de la Iglesia, B. (1995). Using a Genetic Algorithm to data mine in the financial services sector. In Macintosh, A. and Cooper, C. (Ed). Applications and innovations in Expert Systems III. SGES Publications, 237-252. Riddle, P., Segal, R., and Etzioni, O. (1994). Representation design and brute-force induction
in a Boeing manufacturing domain. Applied Artificial Intelligence, 8, 125-147. Smith, G. D. and Mann, J. W. (1994). Gameter: A genetic algorithm in X. In Proceedings
of the 5th Annual EXUG Conference.
Smyth, P. and Goodman, R. M. (1992). An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering, 4, 301-316.