• No se han encontrado resultados

1. ANTECEDENTES

1.6. NECESIDADES DE LA SOCIEDAD ACTUAL Y APOYO DE LOS

The validation error rates reported in table 5-3 are the lowest average validation errors for each method, as obtained with the best hyperparameter combination. Owing to this, it is expected that the validation results would be slightly optimistic compared to test results on unseen data. However, comparing the validation results to the test results in table 5-2, it is found that the test results are significantly worse than the validation results – 15% worse on average. The discrepancy between validation and test results is rather consistent across all method combinations, generally varying between 11% worse and 20% worse, with the exception of the steerable pyramid and LDA combination, where the test result was 8.2% worse than the validation result.

Chapter 5 – Results and discussion: Platinum flotation froths 97

Such a discrepancy between validation and test error is usually a tell-tale sign that overfitting has occurred, that is, the classifier has learnt to fit the random noise or error present in the training set, causing poor generalisation to unseen test data. Overfitting is highly undesirable in any machine learning application, as this makes the model unusable in practice.

Another possible reason for the large difference between validation and test errors is that the training and test data could have significantly different distributions. In the remainder of this section, various reasons for the validation-test error discrepancy are discussed.

Overfitting

A classifier is likely to overfit when:

 a too complex model is fitted to the data,  too many hyperparameters are optimised, or  the dimensionality of the feature set is too high.

In terms of K-NN, a “complex” model is one with a low value for 𝐾𝑁 (the number of nearest neighbours) as this leads to a more complex decision boundary. However, there does not seem to be a trend between 𝐾𝑁 (ranging from 3 for the steerable pyramid features to 11 for the GLCM features) and the difference between training and test results. Non-regularised LDA and QDA (as used here) produce relatively simple models, and have no hyperparameters that affect model complexity. Therefore, it seems unlikely that model complexity could have been a major cause of overfitting. During the cross-validation process, a maximum of three feature extraction hyperparameters and one classification hyperparameter were optimised. Four hyperparameters is not considered to be too many, compared to the size of the data set (2600 images in total). Also, the fact that cross- validation was used (as opposed to just validation) further reduces the chance of overfitting, since it reduces the probability that a specific hyperparameter set was selected only because it fitted the particular training set very well. Thus, hyperparameter optimisation probably did not contribute much to overfitting.

The high dimensionality of some of the feature sets, especially the steerable pyramid feature set that was used together with LDA, can be a cause for concern. Theoretically, the 889 features obtained with steerable pyramids are too high-dimensional when there are 2600 data points, with the smallest class containing only 460 data points. However, the specific feature set with 889 features actually resulted in the smallest difference between the training and test error (8.2% difference). Again, it seems as though high feature set dimensionality could not have been the main cause of overfitting.

It is still possible that a degree of overfitting has occurred. However, the considerations mentioned here, as well as the fact that the validation-test result discrepancy occurs across the board, suggests that the observation might be explained by a different, underlying phenomenon.

Chapter 5 – Results and discussion: Platinum flotation froths 98

Probability distribution estimate of features

The data for this case study is a time series of images, and the data for each class supposedly represents a steady state, during which one would not expect the probability distribution of features to change drastically over time. For each class, the first 75% of the images were used for training, with the remainder constituting the test set. Therefore, if the probability distribution of the series did change with time, it is possible that the distribution of the training features could be significantly different from the distribution of the test features. This would result in poor test performance, as one of the chief assumptions made during classification is that the training and test data follow the same probability distribution.

(a)

(b)

Figure 5-2: The probability distribution estimate of the first principal component score of the steerable pyramid features for (a) the training data without fold 3 and the validation data (fold 3), and (b) all the training data and the test data. The

estimates were calculated with the MATLAB function ksdensity (default kernel type).

-1000 -80 -60 -40 -20 0 20 40 60 80 100 0.005

0.01 0.015 0.02

Training data (without fold 3) Validation data (fold 3)

-1000 -80 -60 -40 -20 0 20 40 60 80 100 0.005 0.01 0.015 Training data Test data

Chapter 5 – Results and discussion: Platinum flotation froths 99

To compare the distributions of training and test features, let us consider the distribution of the features for the steerable pyramid and LDA combination. Figure 5-2 shows the probability density estimate of the first principal component score (explaining 52% variance) of this feature set. In figure 5-2 (a), the probability distribution of fold 3 in the training data is shown with the probability distribution of the remaining training data, while 5-2 (b) shows the distributions of the training and test sets. It is clear from these graphs that the test data distribution differs significantly from the training data distribution, while the distribution of fold 3 in the training data is not too dissimilar from that of the remainder of the training data. This observation holds for all other feature sets. The fact that the data distribution did change proves that either the steady state assumption was false, or a wide range of froth appearances can occur at a single steady state, or both. If steady state had not been reached, the problem may have been remedied by ensuring that steady state had been reached. However, if a wide range of froth appearances may occur at a single steady state, the issue is raised of whether the grade of flotation froth can be determined from the visual appearance of the froth alone. While there certainly seems to be a correlation between froth appearance and froth grade, results might be improved by using such a visual measurement in conjunction with other process data, as is suggested in literature (Bartolacci et al., 2006; Liu & MacGregor, 2008). The collection of more data would help identify the cause of the problem.

Documento similar