• No se han encontrado resultados

4 Materiales y Métodos

Grafica 1. Comparación entre valores de la prueba de tracción.

Intuitively, a variable can become more important if:

It is decisive for many times, i.e. it reduces much the impurity on the tree.

It is decisive at the nodes at or near the root of the tree where the proportion of observations compared to the whole data set is high.

A variable having zero importance is the variable that is not decisive at any node of the tree. That variable does not participate or contribute to the classification or regression process.

C.3. Random Forests (RF)

1.

Introduction

Random Forest (Breiman, 2001) is an ensemble learning method that generates many classification and regression trees (CART - (Breiman, Friedman, Olshen and Stone, 1984)), trains the trees and aggregates their results. Successive trees do not depend on earlier trees - each is independently constructed using a bootstrap sample of the data set.

According to Breiman, (2001), the motivation for inventing RF is that CART is an unstable together with its moderate accuracy. Maximum trees usually work well with training data but have low performance with test data. Tree pruning can improve CART performance with test data and result in trees with relatively higher accuracy. However, CART is unstable as even a small change in training data could also lead to totally different trees which make tree interpretation become problematic. Therefore, the main idea of RF is to create many unstable trees (i.e. the maximum trees) fitting very well the training data such that there is no correlativity between any pair of trees and then aggregate trees’ results.

2.

Weak Learners (WL)

According to Schapire, (1990), a weak learner (WL) is the learner that can produce an hypothesis that

performs only slightly better than random guessing. The author also concluded that it was possible to

According to Breiman and Cutler, (2004), a weak learner is a prediction function that has low bias which comes at the cost of high variance. Breiman and Cutler also demonstrate the idea stated in (Schapire, 1990) that converting weak learners in some way can generate a learn having high prediction power. Maximum trees are good example of weak learners as they fit very well with training data. Figure A-2 presents the performance of four weak learners which are regression trees (CART) in predicting outputs of the sine function y=sin(2πx). Training and test data are generated using the given sine function. Four training data sets are randomly generated to fit four weak learners. The test data set is generated and represented in Figure A-2 as Original Function. Thereafter, the test data are input to four weak learners. The outcomes of the test are illustrated in Figure A-2 as “Weak learner 1”, “Weak learner 2”, “Weak

learner 3”, and “Weak learner 4”. It is clear that there is high variance between the predicted outputs of

weak learners and the expected outputs from the original function. However, if outputs of weak learners for each input are averaged, the averaged output is the better estimation of the original output. According to (Breiman, 2001), if the number of weak learners come to infinitive, the average outputs are precisely the outputs of the original function. The converted learner, which is averaging in this example, is called an ensemble learner.

3.

Randomization

In reality, it is not always possible to get training data such as in the example illustrated in Figure A-2. In many cases, there is only one training data set. Therefore, it is necessary to use the training data set effectively. By applying maximum trees as WL, another issue is that for one training data set, there can be only one maximum tree generated.

Figure C‐1: Maximum regression trees as weak learners and averaging weak learners  0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 X Y = f(X ) Weak Learner 1 Weak Learner 2 Weak Learner 3 Weak Learner 4 Original Function

C-142

bias of weak learners whereas; the variance is significantly reduced. The authors also point out that generated WL should have low correlation to get higher performance of the ensemble learner that aggregates WL.

The independent, identically distributed randomization is used to generate the t-th tree in two steps below: 1) Generate a bootstrap sample of the training set called Bt where t=1:Ntree (the total number of trees).

2) Grow the maximum tree using generated data set such that:

a) At each node, m variables are selected at random out of M variables. b) The split used is the best split on these m variables.

A tree is obtained from two randomizations: the training data set for that tree and the selection of variables at each node of the tree. Bt is the set of indices of observation selected for the t-th tree and is

sampled by replacement from TrainingSet.

4.

Learning Algorithm

The algorithm for building RF is presented in Figure A-3(a). Trees in RF are trained in the similar manner with training CART (see Section b). The algorithm for growing trees in RF is presented in Figure A-3(b). The inputs for training RF include:

 The training data set TSSet which is a part of the whole available data set contained in matrix X.

TSSet is also equal to TrainingSet mentioned in Section 0.

 The number m of variables randomly selected at each tree node.  The number Ntree of trees grown in RF.

According to (Breiman, 2001), m is recommended to be squared root of M (which is the total number of variables in matrix X) and Ntree is selected by trying different values.

Growing a tree in RF based on the data set Lt is similar to growing CART using Lt presented in Figure A-

3. The main difference is that at each node of the tree in RF, the split is determined among potential splits given by m randomly selected variables whereas the split in CART is determined among all potential splits given by all M variables.

Besides growing trees, there are other processing steps as presented in Figure A-3(a) such as calculating error Et for the t-th tree, calculating Out-Of-Bag error for the whole RF, and evaluating variable

Figure C‐2: RF training algorithm 

When all trees are grown, RF aggregates the results. For RF regression, there are Ntree trees grown and the

likelihood for the i-th observation xi (i.e. Traffic Situation) given by the t-th tree is Tt(xi), the likelihood

for the i-th observation by RF is presented in Equation 15.

Documento similar