• No se han encontrado resultados

1. PLANTEAMIENTO DEL PROBLEMA

2.5. Marco conceptual

2.5.1. Tuberías de revestimiento

2.5.1.2. Especificaciones de la tubería de revestimiento

Practical issues in learning decision and regression trees include determining how deeply to grow the tree, choosing the appropriate splitting criterion. The former is strongly connected to handling noisy data and improving computational efficiency. Below we discuss these issues from the perspective of learning from data streams.

3.3.1 Stopping Decisions

One of the biggest issues in learning decision and regression trees is the problem of overfitting the training examples. The definition of overfitting given in the book of Tom Mitchell is as follows:

Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h0∈ H, such that h has smaller error than h0 over the training examples, but h0 has a smaller error than h over

the entire distribution of instances.

There are two main reasons why overfitting my occur. The first reason is noise in the data. Growing each branch of the tree in order to fit the training examples perfectly, i.e., with zero error, would actually mean trying to perfectly fit the noise in the training examples. Overfitting is possible even when the training examples are noise-free and might happen when a small number of examples are associated with a leaf node. In this case, it is very possible for coincidental regularities to occur, under which a particular attribute happens to partition the examples very well although there is no real connection between that attribute and the target function.

There are several approaches to avoid overfitting, grouped in two categories: approaches that stop growing the tree before all the data is perfectly partitioned, and approaches that allow the tree to overfit the data, and then post-prune the tree. Researches have debated extensively on the question which of this approaches is more successful, but it seems that there is no strong evidence in favor of either of them. Breiman et al. (1984) and Quinlan (1993) suggest that it is not appropriate to use stopping, i.e., pre-pruning criteria during tree growth because this influences the quality of the predictor. Instead, they recommend first growing the tree, and afterwards pruning the tree. C4.5 includes three post-pruning algorithms, namely reduced error pruning, pessimistic error pruning and error based pruning. CART also includes a post-pruning algorithm known as cost complexity pruning.

On the other hand, Hothorn et al. (2006) suggest that using significance tests as stopping criteria (i.e., pre-pruning) could generate trees that are similar to optimally pruned trees. In his algorithm CTREES, the stopping decision is combined with the split selection decision, and is formulated as hypothesis testing. In particular, the most significant attribute for splitting is found by first rejecting the H0 hypothesis that there is no relationship between

the input attributes and the output attribute. A permutation test linear statistic is first calculated for all predictor variables. The null hypothesis is rejected when the minimum of the adjusted (by a Bonferroni multiplier) P-values is lower than a pre-specified threshold. If the null hypothesis is not rejected, the tree stops to grow; otherwise the best splitting attribute is suggested.

Regardless of which approach is used, a key question is which criterion should be used to determine the correct final tree size. From the perspective of learning from data streams it is difficult to think of a final size of the induced decision or regression tree, since the process is rather dynamic, and the evaluation phase is interleaved with the learning phase. The fact that there is an abundance of training data makes it highly improbable that overfitting can occur due to modeling coincidental regularities. Thus, the only reason for overfitting is the possible existence of noise in the data. Under the assumption of a stationary data distribution D, the standard criteria typically used in practice are as follows:

Decision Trees, Regression Trees and Variants 29

• Training and validation set approach: The validation set is assumed unlikely to exhibit the same random fluctuations as the training set and is thus expected to provide a reliable estimate of the predictive accuracy. It is, however, important that the validation set is large enough itself, in order to represent a statistically significant sample of the instances.

• Statistical test validation approach: A statistical test is applied to estimate how likely is that further expanding would improve the accuracy over the entire instance distri- bution (a chi-square test, as in Quinlan (1986)). Naturally, these estimates can only be validated in hindsight.

• Minimum Description Length approach: Based on a measure of the complexity for encoding the training examples and the decision (regression) tree, the tree growth is stopped when this encoding size is minimized. The main assumption is that the max- imum a posteriori (MAP) hypothesis can be produced by minimizing the description length of the hypothesis plus the description length of the data given the hypothesis, if an optimal coding for the data and the hypothesis can be chosen.

From the existing approaches and criteria for constraining tree growth, considering the fact that it is difficult to obtain a separate testing set for validation, it seems that the most appropriate method would be to use pre-pruning based on either statistical test validation or on some user defined criteria (such as maximum tree depth or minimal improvement of ac- curacy). Another promising approach is to reuse the idea for combining multiple predictions in learning ensembles of predictors and rely on out-of-bag estimates which are basically a side product of the bootstrapping approach proposed by Breiman (1996a, 2001). A further discussion of this method will be given in Chapter 4.

3.3.2 Selection Decisions

A second, very important, issue which is strongly present in split selection decisions, is the variable section bias, described as the tendency to choose the variables with many possible splits. In particular, Quinlan and Cameron-Jones (1995) observed that ”...for any collection of training data, there are ’fluke’ theories that fit the data well but have low predictive accuracy. When a very large number of hypotheses is explored, the probability of encountering such a fluke increases.” However when training data is abundant fitting a hypothesis by chance becomes less probable.

Hothorn et al. (2006) connect the negative sides of exhaustive search procedures (usually applied to fit regression trees) with a selection bias towards covariates with many possible splits or missing values. He argues for a statistical approach to recursive partitioning which takes into account the distributional properties of the measures. While this approach is not guaranteed to improve accuracy, it is expected to reduce the selection bias and as a result improve the veracity of the hypothesis and its interpretation. Splitting measures that try to balance the size of the child nodes, on the other hand, have shown to improve accuracy. They can be employed in the online learning scenario, by estimating the probability of an example reaching the particular leaf node.

The different methods and approaches proposed to improve the selection decision in decision and regression tree learning algorithms relate closely to strategies for choosing the next move in the search of the space of possible hypotheses H. In this context, several other ideas (which have led to new types of tree-structured predictors) explore multiple paths, which correspond to different refinements of the current hypothesis or to the introduction of some randomness in the selection process. The later can be particularly useful when noise is present in the training data.

30 Decision Trees, Regression Trees and Variants