• No se han encontrado resultados

Previous work on the problem of evaluating batch learning has concentrated on making the best use of a limited supply of data. When the number of examples available to describe a problem is in the order of hundreds or even less then reasons for this concern are obvious. When data is scarce, ideally all data that is available should be used to train the model, but this will leave no remaining examples for testing. The following methods discussed are those that have in the past been considered most suitable for evaluating batch machine learning

algorithms, and are studied in more detail by Kohavi [82].

The holdout method divides the available data into two subsets that are mutually exclusive. One of the sets is used for training, the training set, and the remaining examples are used for testing, the test or holdout set. Keeping these sets separate ensures that generalization performance is being measured. Common size ratios of the two sets used in practice are 1/2 training and 1/2 test, or 2/3 training and 1/3 test. Because the learner is not provided the full amount of data for training, assuming that it will improve given more data, the performance estimate will be pessimistic. The main criticism of the holdout method in the batch setting is that the data is not used efficiently, as many examples may never be used to train the algorithm. The accuracy estimated from a single holdout can vary greatly depending on how the sets are divided. To mitigate this effect, the process of random subsampling will perform multi- ple runs of the holdout procedure, each with a different random division of the data, and average the results. Doing so also enables measurement of the accu- racy estimate’s variance. Unfortunately this procedure violates the assumption that the training and test set are independent—classes over-represented in one set will be under-represented in the other, which can skew the results.

In contrast to the holdout method, cross-validation maximizes the use of examples for both training and testing. In k-fold cross-validation the data is randomly divided into k independent and approximately equal-sized folds. The evaluation process repeats k times, each time a different fold acts as the holdout set while the remaining folds are combined and used for training. The final accuracy estimate is obtained by dividing the total number of correct classifications by the total number of examples. In this procedure each avail- able example is used k − 1 times for training and exactly once for testing. This method is still susceptible to imbalanced class distribution between folds. Attempting to reduce this problem, stratified cross-validation distributes the labels evenly across the folds to approximately reflect the label distribution of the entire data. Repeated cross-validation repeats the cross-validation pro- cedure several times, each with a different random partitioning of the folds, allowing the variance of the accuracy estimate to be measured.

The leave-one-out evaluation procedure is a special case of cross-validation where every fold contains a single example. This means with a data set of

n examples that n-fold cross validation is performed, such that n models are

induced, each of which is tested on the single example that was held out. In special situations where learners can quickly be made to ‘forget’ a single

2.1. PREVIOUS EVALUATION PRACTICES 17 training example this process can be performed efficiently, otherwise in most cases this procedure is expensive to perform. The leave-one-out procedure is attractive because it is completely deterministic and not subject to random effects in dividing folds. However, stratification is not possible and it is easy to construct examples where leave-one-out fails in its intended task of mea- suring generalization accuracy. Consider what happens when evaluating using completely random data with two classes and an equal number of examples per class—the best an algorithm can do is predict the majority class, which will always be incorrect on the example held out, resulting in an accuracy of 0%, even though the expected estimate should be 50%.

An alternative evaluation method is the bootstrap method introduced by Efron [35]. This method creates a bootstrap sample of a data set by sampling with replacement a training data set of the same size as the original. Under the process of sampling with replacement the probability that a particular example will be chosen is approximately 0.632, so the method is commonly known as the 0.632 bootstrap. All examples not present in the training set are used for testing, which will contain on average about 36.8% of the examples. The method compensates for lack of unique training examples by combining accuracies measured on both training and test data to reach a final estimate:

accuracybootstrap = 0.632× accuracytest+ 0.368× accuracytrain (2.1)

As with the other methods, repeated random runs can be averaged to increase the reliability of the estimate. This method works well for very small data sets but suffers from problems that can be illustrated by the same situation that causes problems with leave-one-out, a completely random two-class data set—Kohavi [82] argues that although the true accuracy of any model can only be 50%, a classifier that memorizes the training data can achieve accuracytrain

of 100%, resulting in accuracybootstrap = 0.632× 50% + 0.368 × 100% = 68.4%.

This estimate is more optimistic than the expected result of 50%.

Having considered the various issues with evaluating performance in the batch setting, the machine learning community has settled on stratified ten- fold cross-validation as the standard evaluation procedure, as recommended by Kohavi [82]. For increased reliability, ten repetitions of ten-fold cross- validation are commonly used. Bouckaert [9] warns that results based on this standard should still be treated with caution.

Documento similar