• No se han encontrado resultados

Primjer kada su koordinatni sustavi dobro postavljeni

I have proposed and implemented a novel SNR-based multivariate filter method of variable selection called the MFS algorithm (Section 3.8) and compared it to several existing variable selection methods. I have used simulated normal data (Chapter 4) and non-normal data (Chapter 5), as well as 4 real ophthalmological datasets (Chapter 6). These included three univariate filter methods using chi-square statistics, information gain and the Relief-F algorithm, a multivariate filter method utilising a SVM classifier and an embedded method using random forests.

The MFS-SNR method (a multivariate filter method) was better at the variable selection task than univariate filter methods, as expected, in multivariate normal simulations. All simulated datasets were composed of ten variables (Sections 4.2 & 5.2) with two variables that can discriminate when used alone, one non-discriminating variable that can enhance the performance of the discriminating variables. In simulations of normal data I found that the MFS-SNR algorithm outperformed the univariate filter methods in all 12 scenarios. For the filter methods using chi-square statistics, information gain and the Relief-F algorithm this difference in performance is attributed to the

149 univariate nature of each of these methods. Essentially each of these methods failed to take sufficient account of correlations between the variables 𝑋1, 𝑋2and 𝑋3.

The multivariate MFS-SNR filter method was also found to be better than the multivariate MFS-T2 filter method, in all simulated multivariate normal scenarios i.e. when variance-covariance matrices are heterogeneous. The poor performance of the MFS-T2 algorithm (relative to the MFS-SNR algorithm) can be attributed to the inability of Hotelling’s T2 statistic to accommodate heterogeneous variance-covariance matrices. The MFS-SNR algorithm is a multivariate method and so it took proper account of the correlations between the variables 𝑋1, 𝑋2 and 𝑋3 and unlike

Hotelling’s T2 statistic the SNR metric can accommodate heterogeneous variance-covariance matrices. Thus the superior performance of the MFS-SNR algorithm was not surprising.

I also found that MFS-SNR performed at least as well as computationally intensive methods like SVM and RF in the simulated multivariate normal scenarios. In other words MFS-SNR was comparable in terms of variable selection frequencies and performance estimates across all 12 scenarios. Though non-discriminating variable selection frequencies were generally lower for the multivariate filter SVM method I attribute this, at least in part, to the existence of a cap on the number of variables selected by this method. However, the embedded RF method and the multivariate filter SVM method took longer than the MFS-SNR method to return selections (2:36, 4:48 and 8:50, min:sec, respectively) and had greater computational requirements. All 3 methods identified the importance of 𝑋3 in enhancing the discriminatory performance of 𝑋1 and 𝑋2.

A very important property of MFS-SNR is that it does not require the user to specify the number of variables to be chosen i.e. it does not require number of selected variables a priori. The number of variables was not required a priori by any of the filter methods. However this was required by SVM and RF methods.

In scenarios where the assumption of multivariate normality was violated the MFS-SNR algorithm still selected all 3 discriminating variables (although the selection frequencies fell). In the three simulated scenarios of non-normal data the MFS-SNR algorithm (Chapter 5) showed worse variable selection performance than in scenarios with normally distributed data. At group sizes of 𝑛 = 500 the performance of the MFS-SNR algorithm was similar to when using normally distributed data. The best performance was observed for log-normal transformed data followed by dichotomised and then trichotomised data. The MFS-SNR algorithm still proved capable of identifying the importance of 𝑋3 in addition to 𝑋1 and 𝑋2 following transformation of the variable 𝑋1. I attribute the loss in

150 performance (at least in part), observed when compared to normally distributed data, to loss of information caused by the transformation of the data.

In summary, the analysis of the performance of the MFS-SNR algorithm in each of the simulated scenarios described demonstrated that the MFS-SNR algorithm is capable of selecting those variables with the greatest discriminatory potential:

 whether data are normally distributed or not  over a range of group sizes

 when groups are imbalanced.

The MFS-SNR algorithm achieves similar performance to the RF method without the need to analyse 5,000 variable subsets, selecting the optimal subset of variables in a quarter of the time if took the RF method. The multivariate filter SVM method had a smaller workload than the RF method however it still took nearly twice as long as the MFS-SNR algorithm to identify the optimal subset of variables. The MFS-SNR algorithm achieved similar performance to the SVM method without this computational burden as the SNR metric is capable of quantifying the discriminatory potential of a variable without training and evaluating a classifier. It is also not necessary to have separate training and validation data when using the MFS-SNR algorithm. The MFS-SNR algorithm also functions without any tuning parameters, and without a priori knowledge of the number of the selected variables. The SNR metric is multivariate so correlations between variables are considered by the MFS-SNR algorithm.

As part of the stopping criterion the user must specify the minimum PCC change they wish to see after each variable selection. However, this is part of the stopping criterion and has nothing to do with the variable selection process (i.e. the order of variable selection is not changed by altering the minimum PCC change).

In the simulated scenarios I studied it is evident that whether data are normally or not normally distributed the MFS-SNR algorithm is still capable of identifying the variables with the greatest discriminatory potential. Similarly for the real datasets that were analysed the MFS-SNR algorithm identified the optimal subset of variables from each dataset regardless of whether variables were normally distributed or not. Based on these results the SNR metric does not appear to make any assumptions about the distribution of the data (however it must be noted that this may not be generalisable to every dataset).

151 Lastly, the MFS-SNR algorithm does not exhaustively analyse every possible variable subset in the course of identifying the optimal subset. In the simulated scenarios the MFS-SNR algorithm achieves similar performance to the RF method despite the fact that the RF method is an embedded method which takes a brute force approach to variable selection.

Documento similar