• No se han encontrado resultados

CAPÍTULO IV: LA METONIMIA Y LA METÁFORA

5.2 UN TIEMPO EFICAZ

Decision trees are powerful tools for classification and prediction that are becoming increas- ingly popular (Lahiri, 2006; Delen, 2011). Their advantage is that they represent simple rules that humans can understand, unlike methods as neural networks.

Decision tree

A basic decision tree consists of decision nodes, which can be a root node (‘Body temperature’ in Figure 2.15), or an internal node (‘Gives Birth’), specifying some test to be carried out on a single attribute-value, and leaf nodes that represent the possible classes and are at the bottom of the tree (Mammals or Non-mammals). Let us take the flamingo in Table 2.3 as example. Starting at the root node, the question is whether flamingos have a warm or cold body temperature. Since they have a warm body temperature, we arrive at the ‘Gives Birth’- node. Since flamingos do not give birth, we arrive at the leaf node that says that flamingos belong to the ‘Mammals’ class.

Several algorithms exist for decision trees but one of the most popular is C4.5. We de- cide to elaborate on that one since that algorithm is used in the J48 WEKA implementation. WEKA is an open source data mining tool that makes it easy to implement many data mining and machine learning techniques. C4.5 builds decision trees using the concept of information

Table 2.3: Unlabelled data

Name Body temperature Gives birth ... Class

Flamingo Warm No ... ?

Figure 2.15: Decision tree for the mammal classification problem (Tan, 2006)

gain (difference in entropy).

Letp(i|t) denote the fraction of records belonging to class iat a given node t. Often, we just refer to fractionpi without referring to a specific node. In a two-class problem, the class

distribution at any node can be written as (p0, p1) where p1 = 1−p0. These distributions can be used to find the best splits, which measures are often based on the degree of impurity of the child nodes. A node with class distribution (0.5,0.5) is highly impure whereas (1,0) has zero impurity. The impurity measure that is used in the C4.5 algorithm is entropy which is calculated as follows: Entropy(t) =− c−1 X i=0 p(i|t) log2p(i|t) (2.44) wherec is the number of classes. The splitting criterion is the information gain which is the difference in entropy. The characteristic (‘Body temperature’ or ‘Give Birth’ in the example from Figure 2.15) with the highest information gain is chosen to make the decision. This is repeated until the entire tree is constructed.

Let us give an example. We have a test set of 14 instances of which 5 are mammal and 9 are not. This gives an entropy of:

Entropy(IsM ammal) =Entropy(5,9) =−(0.36 log20.36)−(0.64 log20.64) = 0.94 (2.45) To calculate the information gain of an attribute, we subtract the entropy after the split from that before the split. For example, the information gain of the attribute ‘Body temper- ature’ is calculated as follows:

Gain(IsM ammal, BodyT emp) =Entropy(IsM ammal)−Entropy(IsM ammal, BodyT emp) = 0.94−0.788 = 0.152

For whichEntropy(IsM ammal, BodyT emp) is computed by:

Entropy(IsM ammal, BodyT emp) =p(W arm)∗Entropy(3,4) +p(Cold)∗Entropy(6,1) = (7/14)∗0.985 + (7/14)∗0.592 = 0.788

(2.47) Tables 2.4 give the information gain for both the ‘Body temperature’ and the ‘Gives birth’ attribute. Here we see that the gain when splitting on ‘Body temperature’ is a lot higher than when splitting on ‘Gives birth’. We choose ‘Body temperature’ as decision node, since it has the highest information gain. The C4.5 algorithm repeats this process for every branch until at each branch, there is a node with an entropy of 0 (which becomes a leaf node). An entropy of zero indicates that there is no impurity which means that at the leaf node, there are only instances of one class.

Table 2.4: Information gain of the attributes ‘Body temperature’ and ‘Gives birth’

Mammal Yes No Body Warm 4 3 temperature Cold 1 6 Information gain: 0.152 Mammal Yes No Gives Yes 3 4 birth No 2 5 Information gain: 0.016 Random forest

A risk of decision tree learning is that too many attributes are included which leads to having the chance that, early in the tree, the tree branches on variables that result in a good information gain, but result in little or no information gain later on. A solution for this problem is feature selection. A widely used technique is random forest. Random forests have been introduced by Breiman (2001). Several authors found that significant improvements in classification have resulted from growing an ensemble of trees and letting them vote for the most popular class. Each tree of the forest is grown as follows:

1. LetN be the number of instances in thetraining set, sample N cases at random (with

replacement) from the original data

2. If there are M attributes to classify on, a number m M is specified such that at each split, m features are randomly selected out ofM and the best split of this subset is used to split the node

3. Each tree is grown to the largest extent possible

In order to classify an unseen instance, it should be ran through each tree in the forest, and when for example 150 out of 200 (the majority) trees classify the instance as being ‘Class 1’, the instance is classified as being ‘Class 1’.

In random forests, the importance of the attributes can be calculated by using the Mean Decreas Accuracy (or permutation importance) (Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008). The idea behind this importance measure is that the more the accuracy of the random forest decreases when excluding (or permuting) a single attribute, the more important that variable is regarded.

The advantages of random forests are that they do not overfit due to the Law of Large Numbers and they are quite fast (Breiman, 2001). Besides, generated forests can be saved for future use of other data (Strobl et al., 2008).