• No se han encontrado resultados

3. CARACTERÍSTICAS DEL CASO PARAGUAYO

3.2. PLAN ESTRATÉGICO DE LA REFORMA EDUCATIVA

polarity = polar + - charge = neutral + -

Figure 2.8: A binary decision tree for predicting molecular toxicity. All tests in the tree are boolean. Examples succeeding the test take the left branch, examples failing it the right branch.

Algorithm 2.1 A generic TDIDT algorithm.

function TDIDT(E: set of examples) returns decision tree

1: T := set of all possible tests 2: τ∗:= arg max

τ∈T quality(τ, E) 3: if stop criterion (τ∗, E)

4: return leaf(local model(E)) 5: else

6: P := partition induced on E by τ∗ 7: for all Pj in P

8: Tj := TDIDT(Pj) 9: return node(τ∗,j{(j, Tj)})

In practice, there are several issues that are not covered by this simplified algo- rithm. For example, extensions have been proposed that specify how to perform tests on non-nominal attributes, how to use pruning criteria to stop growing the tree and how to predict probabilities or even multiple targets. More information about these extensions can be found in [Mitchell, 1997; Blockeel, 2007].

2.5

Measuring the quality of a learned model

An important aspect of machine learning is to assess the quality of the learned models. Predictive performance is an important measure, but one might also be in- terested in efficiency (e.g., the time to learn the model and to perform predictions) or interpretability (e.g., the possibility of acquiring insights from a knowledge dis-

Table 2.6: Contingency matrix for a binary classifier. Real Positive Real Negative Predicted positive TP FP Predicted negative FN TN

covery point of view).

Here we focus on predictive performance measures. A lot of different measures exist and in a particular context it is not always straightforward to select an ap- propriate one. Therefore, it is important to understand what exactly an evaluation measure is measuring, such that, given the situation, the most sensible one can be selected.

When measuring the predictive performance of a model, ideally one is inter- ested in the performance on the complete example space, which is unknown. In practice, only a limited amount of data is available. Usually a first part of this data, called the training data, is used to train the model and a second part, which has the same distribution as the first part and is called the test data, is used to estimate the performance. The training data cannot be used to measure perfor- mance, since it will return too optimistic estimates. However, when the amount of data is limited, one wants to use as much examples as possible to learn the model. A common way to solve this issue is known as crossvalidation. N-fold crossvalidation partitions the available data in N equally sized folds. Then, the learning algorithm is repeated N times, each time leaving out one fold as test set and using the remaining N −1 folds for training. The estimates computed on each test fold are then averaged to get a final estimate.

In this section, we will discuss several evaluation measures that will be used further on in the text. We only consider evaluation measures for classifiers, that is, models that predict a certain class. For numerical predictors, other evaluation measures are used. For more details, we refer to [Blockeel, 2007].

But first, we present the contingency matrix. Given a binary classification problem, and given the predictions of a binary classifier (predicting positive or negative), one can always make the matrix depicted in Table 2.6. Each example ends up in one of four different cells in the matrix. The true positives (TP) and the true negative (TN) are the examples that were correctly classified as positive and negative, respectively. The false positives (FP) are the negative examples that were incorrectly classified as positive. Finally, the false negatives (FN) are the positive examples that were incorrectly classified as negative. If N is the number of examples in the dataset, then it is equal to T P + F P + T N + F N.

2.5 Measuring the quality of a learned model 31

2.5.1

Accuracy

The most straightforward way to assess the predictive performance of a classifier is to count its number of correct predictions. This leads to the accuracy of a classifier: it corresponds to the fraction of correct predictions on a dataset. More precisely, given the contingency matrix, the accuracy of a model is given by

Acc = T P + T N N

However, in some cases, accuracy is not a suitable evaluation measure. For ex- ample, when dealing with imbalanced class distributions, the accuracy may give a misleading view on a classifier’s performance.

Example 2.9 Consider a dataset of HIV viruses of which 5% has developed re- sistance against the drug Indinavir. A classifier predicting every example to be non-resistant will obtain an accuracy of 95%, while it can be hardly considered a good model. On the other hand, a classifier that predicts 15% of the viruses to have developed resistance, of which 5% that really are resistant, then it has an accuracy of 90%, as it correctly classifies the 5% of resistant examples but misclassifies 10% of the non-resistant examples. So, based on accuracy, one would judge that the first classifier is better, though the second classifier is clearly more informative as it has learned relevant features to correctly classify viruses that have developed resistance.

Also in the case where there are different misclassification costs, accuracy has some drawbacks. For example, the cost of failing to identify a rare illness in a patient (by incorrectly classifying him as healthy) could be very high, if for example it leads to the patient’s death, when compared to the cost of incorrectly classifying a healthy patient, which would result in some unnecessary medical treatments.

2.5.2

ROC analysis

We start by introducing two new measures: the true positive rate (TPR) or the proportion of positive examples that is correctly classified and the false positive rate (FPR) or the proportion of negative examples that is correctly classified.

T P R = T P

T P + F N, and F P R = F P F P + T N

A cartesian coordinate system where the X and Y axes correspond to FPR and TPR respectively, is known as a Receiver Operation Characteristics (ROC) space. A binary classifier represents a single point in ROC space. The point (0,0) corresponds to a classifier that predicts all the examples as negative, while the point (1,1) corresponds to the classifier that always predicts positive. A classifier that randomly predicts examples as positive with probability p and as negative

Figure 2.9: Example of a ROC curve.

with probability 1 − p is plotted as a point on the diagonal. A perfect classifier has a TPR of 1 and an FPR of 0.

When a classifier predicts probabilities, it is possible to construct a ROC curve by varying a classification threshold between 0 and 1. All examples with a pre- dicted probability greater than the threshold value are classified as positive and the remaining as negative. A plot of a ROC curve is shown in Fig. 2.9.

The area under the ROC curve (AUROC) is a measure for how well the model can discriminate between positives and negatives. More precisely, it represents the probability that a positive and a negative example chosen randomly from the dataset are ordered correctly, that is, the negative example has a lower value than the positive example. Usually, AUROC is preferred over accuracy as it is not affected by imbalanced class distributions and it allows one to trade off the possibly different costs of incorrectly classifying a positive example as negative and vice versa. A perfect classifier has an AUROC of 1, a non-informative classifier an AUROC of 0.5.

Further background on ROC analysis in machine learning and data mining can be found in Provost and Fawcett [2001] and Fawcett [2006].