7.4
Model optimisation through feature
reduction
In this section, the best-performing classification models identified in the pre- vious section are further refined. The competitive evaluation of models in the previous section already included a degree of optimisation in the form of tun- ing model parameters, but the idea of the choice of optimal model variables can be extended to other aspects of the training process.
The so-called ‘curse of dimensionality’ is a major concern in the field of classification models. The expression was created by Bellman (1956) to refer to the fact that many models work well in low dimensions but become intractable in higher dimensions. Generalising a model becomes exponentially harder as the number of features in a sample grows, because a fixed-size training set will cover a much smaller fraction of the possible input space.
Some classification models can also be negatively affected by the presence of features that are non-informative. Adding features of non-informative data to models and evaluating the resultant accuracy of prediction would add com- plexity to most models. An exception to this rule is the general category of tree-based classification models. CART and Random Forest already include a measure of dynamic feature elimination in the construction of the models. It is for this reason that feature reduction is presented at this point of the chapter, the only other strong contender for best model for this project was the Random Forest classifier and this discussion is not relevant to that model. To judge which features are redundant, or even detrimental, to the effec- tiveness of the model a similar search algorithm can be deployed as that used in the previous section to determine the optimal model tuning parameters. Such methods conduct a search of the features to determine which produce the best results when entered into the model. There are three approaches exist for feature selection, as follows:
Forward selection: Features are added to the model one at a time, and the resultant prediction is evaluated for statistical significance.
Stepwise selection: Features are removed from the model individually, and the rest kept.
Backward selection: Features are iteratively removed from the model. A variation on backward selection is called recursive feature elimination (RFE), which avoids training a model each time a feature is removed (Guyon and Elisseeff, 2003). A prerequisite to enable this elimination is determining which features contribute most to the prediction capability of a model.
Gevrey et al. (2003) use a combination of the absolute value of weights in neural network models to determine feature significance. The caret R package implements this method as a helper function. Assessing the significance of the features in the neural network model of this research results in the importance
of the individual features that are shown in Figure 7.9. By examining the results, it can be seen that mel_bin9 and mel_bin11 are considered the most important indicators to determine structural stability. These features corre- spond respectively to the energy in frequency bands [1 200 Hz–2 022 Hz] and [3 000 Hz–3 933 Hz]. These are similar frequency ranges to those that Hanson (1985) found to be effective for his loose rock acoustic assessment system, as discussed in Subsection 2.2.2 on page 13 of this dissertation.
It was found that reducing the existing features with the RFE method on the individual models did not result in a statistically significant improvement in the predictive results on the independent set for either of the classification models. The neural network had an initial ROC value of 0.933 for the full feature set, and 0.936 for a set with two features removed. The SVM model with the polynomial function had a full-set ROC value of 0.785, which in- creased to 0.793 with three features removed. Neither of these improvements is statistically significant.
An alternative considered for feature elimination in this project is a mod- ification of the stepwise feature selection technique, based on the elimination of the specific feature groupings of this project. The three feature groupings derived in Chapter 6 and used in the combined feature set were the spectral descriptors, the frequency band content, and the MFCCs. The assumption was that each of these groupings of features contain information about the acoustic data that is useful and unique. However, by applying the modified stepwise feature selection technique to the data this assumption can be tested.
7.4. MODEL OPTIMISATION THROUGH FEATURE REDUCTION 113
Table 7.3 shows the results from applying the neural network model to reduced feature sets. Each of the possible convolutions of the groupings were used. The final result shows that the combination of all three feature groupings show the best performance, therefore the assumption that each grouping contributes to the structure of the combined feature group holds true.
Another way to reduce the feature dimensionality is to construct new fea- tures, which attempts to distil and retain the information in the existing fea- tures. This approach is called ‘feature extraction’. Feature extraction aims to transform data in high-dimensional space to a space with fewer dimensions while retaining the variance of the data. Principal component analysis (PCA) is a multivariate technique that summarises systematic patterns of variation in the data by performing a linear mapping of the data. This technique at- tempts to transform observed features into a set of new features called the ‘principal components’, which are uncorrelated and reveal the dominant types of variations in the samples.
The steps in deriving the principal components are as follows: 1. Compute the mean for each feature (Subsection 6.6).
2. Compute the covariance values of all the features (Subsection 6.6.2). 3. Compute eigenvectors and then the corresponding eigenvalues.
4. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d × k dimensional matrix W . 5. Use this matrix to transform samples onto new subspace y = WT × x,
where x is one sample.
The choice of how many principal components k to use is another search problem, though the variance of the individual components gives an indication of which ones represent the most information. Visualising the variance per principal component is known as a screeplot, and is shown in Figure 7.10.
The function prcomp is included in the base R installation, which means it is considered authoritative enough to form the basis of PCA calculations in R. This function was used to create candidate principal components from
Table 7.3: Feature grouping stepwise elimination for neural network model
Feature grouping tested ROC Sens Spec Descriptor 0.735 0.607 0.850 Band 0.790 0.673 0.856 MFCC 0.780 0.672 0.852 Descriptor + Band 0.831 0.707 0.858 Descriptor + MFCC 0.867 0.786 0.876 Band + MFCC 0.892 0.870 0.896 All combined 0.933 0.883 0.925
Figure 7.10: Descending variance of principal components
the ESD feature data. No non-informative components were found in the final transformation that could easily be discarded, so a threshold had to be implemented to discard components below certain thresholds.
Based on visual inspection of Figure 7.10, a drastically reduced subset of the data was considered. This subset consisted of the first four principal com- ponents. The performance of the SVM model showed significant degradation, from an ROC AUC value of 0.785 to 0.744. The neural network model showed a similar degradation, from 0.933 to 0.895, though the order of decrease is proportionally less significant than that of the SVM.
An alternative PCA reduction approach is suggested by Venables and Rip- ley (2002), who states that a standard choice is to discard principal components with a variance less than the standard deviation of the first component. Af- ter applying this criterion, a total of 29 components was left over. This is a reduction in dimensionality of 4 from the original 33 features.
The two contender classification models were trained and tested on the result. The SVM showed a significant increase in ROC AUC value from 0.785 to 0.813. However, the neural network still displayed the best performance with a ROC AUC value increase from 0.933 to 0.942. The final plot of the neural network with the reduced feature set is shown in Figure 7.11 on the facing page. The neural network model, like the other models evaluated in this project, outputs a single value on a scale between 0 and 1, indicating ‘safe’ and ‘unsafe’ respectively. To determine the interpretation of the value, a threshold boundary is applied. By examining the ROC curve, the optimal threshold