Actividades de comprensión
CONTENIDOS 4.4.1 Promoción.
In a decision tree, an input is entered at the top and as it transverses down the tree the data is divided into smaller and smaller subsets.
Tree-based machine learning algorithms use recursively binary partitioning to split the feature space,R, into disjoint regions, R1, ..., RM. Each split is performed with respect to a single feature,
producing a partitioning of R into a set of disjoint “rectangles” in the feature space (represented by the nodes of the tree). At each step, the algorithm selects both the feature and split point that produces the smallest impurity in the two resultant nodes. The splitting process is recursively repeated in order to build a tree with multiple levels (Richards et al. 2011).
To build a classification tree, begin with a training set of (feature, class) pairs{(Xi, Yi)}Ni=1where
Xi denotes the vector of features of the i-th source in the training set, and Yi denotes the class
Following Richards et al. (2011), at node m of the tree – representing a region Rm of the feature
space R – the probability that a source with features in Rm belongs to class c is estimated by
ˆ pmc= 1 N X Xi∈Rm I(Yi = c). (4.25)
This is the proportion of the Nm training set objects in node m whose class is c. The indica-
tor function I(Yi = c) is defined to be 1 if Yi = c and 0 else. During the tree-building process,
each subsequent split is chosen among all possible features and split points so that it minimizes a measure of the resultant node impurity. Measures of the node impurity are e.g. the Gini in- dex (Gini 1913) P c6=c0 ˆ pmcpˆmc0 or the entropy − C P c=1 ˆ
pmclog2pˆmc. This splitting process is repeated
recursively until some pre-defined stopping criterion (such as the relative improvement in the objective function) is reached. Once a classification tree is trained on a training set{(Xi, Yi)}Ni=1,
it is straightforward to predict the class of unseen data sets Xnew. Specifically, the algorithm
identifies the part of the decision tree Xnew resides in and then assigns a class according to that
node’s estimated probabilities given in Equ. (4.25). For example, if Xnew∈ Rm, then the assigned
probability that the source is of class c is ˆ
pc(Xnew) = ˆpmc, (4.26)
where ˆpmc is defined in Equ. (4.25). Using Equ. (4.26), the predicted class is the class for which
the highest value of ˆpc(Xnew) is reported, ˆp(Xnew) = arg maxcpˆc(Xnew).
The classification output for each new source can then be described either as a vector of class probabilities (giving the probability for each of the C classes) or as its predicted class (with the highest probability).
Decision trees, automatically constructed by machine learning algorithms, can generate powerful classifiers due to both their conditional structure and their high execution speed. The method shown so far tempts to construct very large trees, as they will indeed fit the training set well. However, decision trees often cannot be grown to the desired complexity because of loss of gen- eralization accuracy on new (“unseen”) data occurs. Another problem is that trees can be prone to be overly adapted to the training data, or being too complex and thus overfit data. On the other hand, constructing a very lean tree will likely not be sufficient to capture the complexity of the underlying process that led to the different classes well, and thus will be not sufficient for classifying. In the end, the appropriate size of a classification tree depends on the complexity of model necessary for the particular application at hand and hence should be determined by the data.
The standard approach to this problem is to build a large tree and then to prune this tree to find the sub-tree that performs best in verification methods like the approaches shown in Sec. 4.2.4. Pruning back a fully-grown tree may increase the generalization accuracy at unseen data, often at the expense of the accuracy on the training data. Probabilistic methods that allow descent through multiple branches with different confidence measures also do not guarantee optimization
of the training set accuracy. Apparently there is a fundamental limitation on the complexity of tree classifiers – they should not be grown too complex to overfit the training data.
The development of ensemble methods has led to significant improvements in classification accu- racy. Such methods grow many trees, forming an ensemble, and letting the trees “vote” for the most probable class. Carrying out such a divide-and-conquer approach improves the classification performance. The main principle behind ensemble methods is that a group of “weak” classifiers can form a “strong” one. An example of such a method is bagging (Breiman 1996), where for the construction of each tree a bootstrap sample (a random selection without replacement) is made from the sources in the training set: given a specific training set T , form bootstrap training sets Tk, construct classifiers h(x, Tk) and let these vote to form the bagged predictor.
Another example is random split selection (Dietterich 2000), selecting at each node a split at random from among the K best splits. Randomized outputs (Breiman 1998, 1999) grows trees on training sets with randomly perturbing the output of the original training set: For a fixed number s, at each node, s best splits (in terms of minimizing deviance) are found and the actual split is randomly uniformly selected from them. Random feature selection (Amit and Geman 1997; Breiman 1999) looks for the best split over a random subset of the features. The random subspace method by Ho (1998) does a random selection of a subset of features to grow each tree. Perfect Random Trees Ensembles Cutler and Zhao (2001) uses an extreme randomness: at each node, randomly choose a variable to split on, and on the chosen variable choose randomly uniformly a split point between two randomly chosen points coming from different classes. The Random Forest Classifier (Breiman (2001), see also Sec. 4.2.3 for a detailed description) is an ensemble method where each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.