• No se han encontrado resultados

1. INTRODUCCIÓN

1.3. EL CULTIVO ANUAL DE LA PATACA

1.3.5. CONTROL DE MALAS HIERBAS

The MS provides means for biomarker detection that will be beneficial in many applications, including early detection of diseases and discovery of new drugs. The analysis of MS data for a biomarker detection is challeng- ing due to the high-feature-to-sample ratio and the presence of noise in the data. Biomarker detection is performed through feature manipulation.

Due to the flexibility and capability of GP to automatically select and construct features, GP can be a promising choice for biomarker detection.

There has been a rapid grow in the MS biomarker detection research. However, there are still major open issues that remain to be investigated:

Feature Ranking in Biomarker Discovery: Most of the previous meth- ods used a single ranking method to rank features and used the top ranked features for classification. However, different feature ranking metrics pro- vide different ranks for the same features according to their evaluation criteria. None of these methods tested the use of more than one ranking metrics together. The collection of these ranking metrics can provide sub- sets of top ranked features and low-ranked features. This collection has

2.6. CHAPTER SUMMARY 53 the potential to provide a better collection of features, leading to better classification results. In [147], GP was used to combine four metrics and it was applied successfully on datasets with a small number of features, but it was not tested on datasets with a large number of features as in the MS datasets. Also, using the single ranking schemes ignores the interactions and relationships between the features.

Feature Construction: Most of the GP based feature construction ap- proaches were based on constructing a single feature, either using this sin- gle feature for classification or using this feature along with the original set of features. Using the single constructed feature alone might not achieve acceptable classification accuracy and using the combination of a single constructed feature along with the original set of features will increase the dimensionality [60, 107]. The second approach is therefore inappropriate for high dimensional data like MS data, where the number of features ex- ceeds thousands. Moreover, none of these methods investigates the effect of constructing multiple features from a single tree during the evolution- ary process of GP. Also, feature construction in MS data has not been con- sidered before.

Multi-objective GP for Biomarker Detection: Multi-objective GP op- timisation for MS biomarker detection using both feature selection and construction has not been considered before and, investigation of this di- rection needs to be carried out.

Biomarker Verification using Peptide Detection: The computational approaches for peptide detection are effective, but the limitation of not considering the imbalance problem of the peptides datasets lowers their sensitivity performance. Moreover, the previous approaches mostly con- sidered selecting the features using ranking methods, which ignores the dependence and interactions between the features. Hence, considering GP for performing these multiple tasks is a worthing trial and can potentially lead to improving the process of detectability. Furthermore, verifying the

detected biomarkers using the peptide detection method needs more in- vestigation.

The next four chapters propose new GP algorithms that can address the above issues.

Chapter 3

Ensemble Feature Ranking

3.1

Introduction

Feature selection is an important technique for biomarker discovery in MS data because many of the classification techniques cannot easily handle such a huge number of features. Feature ranking is a type of feature selec- tion, where each feature is given a rank according to its relevance to the classification task [147]. Different feature ranking approaches usually give different ranks to the same features. Clearly, some of the top features may be highly relevant or powerful while other features may be weakly rele- vant or redundant [147]. Further selection and ranking of features based on sets of features produced by different feature ranking methods have the potential to provide a new and smaller set of features with less redun- dancy and more relevance to classification.

3.1.1

Chapter Goals

The overall goal of this chapter is to investigate the capability of GP for im- proving feature ranking performance. To achieve this goal a new ensemble- based feature ranking GP algorithm has been developed. The algorithm combines two well-known feature ranking metrics, namely information

gain (IG) and relief-f (RF), to select a new and smaller set of features. A new rank that can effectively improve the classification performance of the selected features is given to each of the selected features. Meanwhile, GP is used as a classifier as well, and the proposed algorithm takes an embed- ded approach. Specifically, we will investigate the following objectives:

• what ranking scheme is suitable for selecting good features;

• whether a small number top ranked features obtained by the pro-

posed GP method can achieve better classification performance than using all the original features;

• whether the smaller top ranked features can outperform a relatively

large number of the top ranked features obtained by IG and RF, re- spectively;

• whether different classifiers using the 20 top ranked features can

achieve better performance than the 100 top ranked features obtained by IG and RF, respectively.

Chapter Organisation:The rest of the chapter is organised as follows. The second section describes the new ensemble GP method for feature ranking in MS data. Experimental design are presented in the third sec- tion. The fourth section descries the datasets and preprocessing. The fifth section presents the results of discussions. Some further discussions are presented on the sixth section. The seventh section gives a summary of the chapter.

Documento similar