Heuristics: Evolving Teams
of Local Bayesian Learners
Jorge Muruzábal
Universidad Rey Juan Carlos, Spain
Copyright © 2002, Idea Group Publishing. Evolutionary algorithms are by now well-known and appreciated in a number of disciplines including the emerging field of data mining. In the last couple of decades, Bayesian learning has also experienced enormous growth in the statistical literature. An interesting question refers to the possible synergetic effects between Bayesian and evolutionary ideas, particularly with an eye to large-sample applications. This chapter presents a new approach to classification based on the integration of a simple local Bayesian engine within the learning classifier system rule- based architecture. The new algorithm maintains and evolves a popula- tion of classification rules which individually learn to make better predictions on the basis of the data they get to observe. Certain reinforce- ment policy ensures that adequate teams of these learning rules be available in the population for every single input of interest. Links with related algorithms are established, and experimental results suggesting the parsimony, stability and usefulness of the approach are discussed.
INTRODUCTION
Evolutionary algorithms (EAs) are characterized by the long-run simulation of a population of functional individuals which undergo processes of creation, selec- tion, deployment, evaluation, recombination and deletion. There exists today a fairly wide variety of EAs that have been tested and theoretically investigated (see e.g., Banzhaf, Daida, Eiben, Garzon, Honavar, Jakiela & Smith, 1999). One of the most interesting and possibly least explored classes of EAs refers to the learning classifier system (LCS) architecture (Holland, 1986; Holland, Holyoak, Nisbett & Thagard, 1986). In this chapter we shall be concerned with a new LCS algorithm (called BYPASS) for rule-based classification. Classification is indeed a most relevant problem in the emerging data mining (DM) arena (Fayyad, Piatetsky- Shapiro, Smyth & Uthurusamy, 1996; Freitas, 1999), and many issues still require further investigation (Michie, Spiegelhalter & Taylor, 1994; Weiss & Indurkhya, 1998).
In recent years, the sustained growth and affordability of computing power has had a tremendous impact on the wide applicability of Bayesian learning (BL) methods and algorithms. As a result, a number of solid computational frameworks have already flourished in the statistics and DM literature (see for example, Buntine, 1996; Cheeseman & Stutz, 1996; Gilks, Richardson & Spiegelhalter, 1996; Heckerman, 1996). It seems fair to say that more are on their way (Chipman, George & McCulloch, 1998; Denison, Adams, Holmes & Hand, 2000; Tirri, Kontkanen, Lahtinen & Myllymäki, 2000; Tresp, 2000). BL approaches establish some prior distribution on a certain space of structures, and inferences are based on the posterior distribution arising from this prior distribution and the assumed likelihood for the training data. Predictive distributions for unseen cases can sometimes be computed on the fly, and all these distributions may be coherently updated as new data are collected.
Given the wide diversity of these paradigms, it is not surprising that synergetic effects between BL and EAs have been naturally explored along several directions. To begin with, and perhaps most obviously, many EAs are function optimizers, hence they can be used to tackle the direct maximization of the posterior distribution of interest (Franconi & Jennison, 1997). On the other hand, the Bayesian optimiza- tion algorithm (Pelikan, Goldberg & Cantú-Paz, 1999) and other estimation of distribution algorithms, see for example the rule-oriented approach in Sierra, Jiménez, Inza, Muruzábal & Larrañaga (2001), replace the traditional operators in EAs (crossover and mutation) with probabilistic structures capturing interdepen- dencies among all problem variables. These structures are simulated to yield new individuals, some of which are used to build a more refined model for the next generation (see also Zhang, 2000).
By way of contrast, the BYPASS approach does not attempt to formulate any global model of either the variables or the population of classification rules. Examples of BL global models (and algorithms) for tree and Voronoi tessellation sets of rules are given in Chipman et al. (1998), Denison et al. (2000) and Paass and
Kindermann (1998). All these models are either anchored on non-overlapping rules, or maintain the familiar (exhaustive) tree-based interpretation, or both. The BL in BYPASS is essentially local in that it only affects individual rules in the population. As in other LCSs, the (self-)organization of the population relies entirely on the sequential reinforcement policy. In the case of BYPASS, however, reinforcement is in turn tightly linked to (individual) predictive scoring, which improves substan- tially when previous experience is adequately reflected via BL. This is the type of synergy or cross-fertilization explored in this chapter.
The attempted organization process in BYPASS is based as usual on perfor- mance by the LCS’s match set, a subset (team or committee) of rules available for any given input. Organization of these teams may turn out to be relatively difficult depending on a variety of features including the shape and relative position of the input regions associated with output labels, as well as the appropriateness and complexity of the selected representation scheme and reinforcement policy. The dynamics of the LCS algorithm is typically quite rich: useful new rules have to find their place among other (partially overlapping) rules; tentative, poor rules add noise to the performance subsystem (which in turn affects the reinforcement process); the contribution of the typical rule discovery engine, a genetic algorithm, is hard to tune up, and so on.
These aspects make the BYPASS approach very different from other ap- proaches based on committees of rules as, for example, Breiman’s bagging trees (1996) (see also Domingos, 1997). Standard classification trees (Breiman, Fried- man, Olshen & Stone, 1984) constitute a natural reference for BYPASS since they often provide rules which are essentially identical in structure. In the bagging trees algorithm, a basic tree learner is used repeatedly to provide a number of parallel, independently generated solutions, that is, no information between these solutions is exchanged during the building phase. These single-tree-based predictions are combined via majority voting to yield improved performance. The present LCS approach tries to extend this framework by letting the interactions between rules within teams be a critical part of the overall organization process.
Like other LCSs (Wilson, 2000), the BYPASS approach presents many appealing features from the DM perspective. The amount of memory required by the system depends on the size of the population, so huge data sets can be tentatively analyzed via reduced sets of general, long-reaching rules. LCSs are rather autono- mous and can be left alone with the data for awhile; when performance is deemed appropriate, they can be halted and useful populations extracted and post-processed, or else populations can be merged and reinitialized. The induced probabilistic rules are easy to interpret and allow for characterization of broad regularities. LCSs can also benefit from parallel implementations facilitating faster learning when needed. Last but not least, the architecture is open-ended, so further heuristics can be incorporated in the future.
This chapter reviews the BYPASS algorithm and illustrates its performance in both artificial and real data sets of moderate dimensionality. It is shown that the system can indeed self-organize and thus synthesize populations of manageable
size, notable generality, and reasonable predictive power. These empirical results are put in perspective with respect to the tree representation and bagging trees method. The organization is as follows. The second section reviews some relevant