• No se han encontrado resultados

6. CAPITULO VI: RESULTADOS

6.3 Niveles de cumplimiento en las cuatro zonas estudiadas

The remaining of this chapter focuses on FS: this section introduces the datasets used in this study, and the following sections evaluate the FS algorithms on these datasets. Table 5.3 summarizes these datasets: all datasets used here are publicly available, and most have been previously used in the FS literature. In cases of missing entries in a dataset, the corresponding sample in the data matrix was discarded. To test the FS algorithms we used three artificial datasets (where the true features and the probes are known in advance), and ten real datasets. Here, we provide a very brief description of each dataset and refer to the original studies and the publicly available repositories cited in Table 5.3 for further details about each dataset.

The first artificial dataset we use is the well-known MONK-1. It consists of 124 samples and six features: three features are predictive of the response, and three features are probes. The relationship of the features to the response is based on logical operators, a setting which is difficult to handle for some FS algorithms. We also used Isabelle Guyon‘s artificial dataset generator to obtain two datasets, which we call Artificial1 and Artificial2. The Artificial1 dataset consists of 500 samples and 150 features: there are 50 independent, 50 dependent, and 50 repeated features, and only 10 are ―true‖ features. Moreover, there is a 10% fraction of flipped responses in this binary classification problem. The Artificial2 dataset consists of 1000 samples and 100 features: there are 50 independent, 25 dependent, and 25 repeated

features, and 20 are the ―true‖ features. There are 10 classes in a nicely balanced dataset where we have 100 data samples for each class.

One real dataset widely used in FS algorithm comparisons is the hepatitis dataset (Diaconis and Efron, 1983). It includes 155 patients and the binary outcome (healthy control subject versus subject with hepatitis disease) depends on 19 features. This dataset has been studied in detail by Breiman (2001), who concluded that features and were highly predictive of

the response (and highly correlated with each other). Breiman suggested that either of those two features individually carries almost as much information as the entire feature set. The features 9 and were also identified as conveying some additional information for predicting the response. More recently Tuv et al. (2009) identified the following feature subset using a scheme based on random forests: * 9 +. That study contrasted their proposed FS algorithm with three alternative FS algorithms, which unanimously selected variable .

The Parkinson‘s dataset (Little et al., 2009) uses 22 dysphonia measures obtained from 195 sustained vowel phonations. In the original study, the optimal feature subset was selected by using a two-step approach: first a simple filter approach eliminated one of the pair of highly correlated features (absolute correlation coefficients larger than 0.95). In the second step, a brute force search determined the best feature subset out of the remaining 10 features using a wrapper approach with SVM. The optimal feature subset according to Little et al. (2009) is * +.

The Sonar dataset (Gorman and Sejnowski, 1988) is from the application of sonar signals (frequency-modulated chirp rising in frequency) aiming to predict whether the targeted object is a mine or a rock. Each of the 60 features represents the energy within a particular frequency band integrated over a period of time.

The wine dataset comes from the chemical analysis of wines grown in the same region, where 13 features (such as alcohol, magnesium and colour intensity) are used to differentiate the three cultivars.

The image segmentation dataset uses 19 features from images in order to identify whether the investigated region of the image (each sample) belongs to the following seven classes: brickface, sky, foliage, cement, window, path, grass. Its developers have split it into two subsets: a training set with 210 samples and a testing set with 2100 samples. We use the training set to select the features, and then use 10-fold cross validation with 100 repetitions to evaluate the performance of the learners on the testing set.

The cardiotocography dataset (Ayres-de Campos et al., 2000) has 2129 fetal cardiotocograms which were processed and classified by three expert obstetricians. The class label refers to a morphological pattern and was assigned 10 possible classes obtained by consensus from the three experts.

Finally, we use two datasets where the number of features is larger than the number of samples (also known as ‗fat‘ datasets): these problems are inherently difficult for many machine learning algorithms, and have attracted the dedicated attention of researchers (Hastie et al., 2009). The ovarian cancer dataset consists of 72 samples and 592 features. There are many ovarian cancer datasets in the machine learning literature; here we use the dataset from Guan et al. (2009). In that study, the focus was on developing a biomarker of ovarian cancer based on metabolic changes in biological systems in order to differentiate subjects into the binary classes ―healthy‖ versus ―cancer‖. We used leave-one-sample-out to obtain 72 candidate feature subsets for each FS algorithm; then we used the voting scheme described in Table 4.4 to select the final features for each FS algorithm. The performance of the FS algorithms was assessed using leave-one-sample-out validation with RF.

The Small Round Blue-Cell Tumors (SRBCT) dataset (Khan et al., 2001) has 88 samples and 2308 features, which in this application are expressions profiles of genes, and is one of the most widely used datasets for validating FS algorithms in the domain of bioinformatics. The four-class response denotes the type of the tumor. We used 63 samples for selecting the features and training the classifier, and tested the performance of RF using the selected feature subsets on the remaining 25 samples. We used the partitioning of the samples into training and testing sets suggested by Hastie et al. (2009).

Table 5.3: Summary of the datasets

Dataset

Design matrix Associated task

Type

MONK135 124×6 Classification (2 classes) D (6)

Artificial 1 500×150 Classification (2 classes) C(150)

Artificial 2 1000×100 Classification (10 classes) C(100)

Hepatitis35 155×19 Classification (2 classes) C (17), D (2)

Parkinson‘s35 195×22 Classification (2 classes) C (22)

Sonar35 208×60 Classification (2 classes) C (60)

Wine35 178×13 Classification (3 classes) C (13)

Image segmentation35 2310×19 Classification (7 classes) C (16), D (3) Cardiotocography35 2129×21 Classification (10 classes) C (14), D (7) Ovarian cancer36 72×592 Classification (2 classes) C (592)

SRBCT37 88×2308 Classification (4 classes) C (2308)

The size of each design matrix is , where denotes the number of instances (samples), and denotes the number of features. The last column denotes the type of the design matrices‘ variables: continuous (C) or discrete (D). In cases of missing entries, the entire row in the design matrix was deleted.

35 Downloaded from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html 36 Downloaded from http://www.biomedcentral.com/1471-2105/10/259/additional

37

Documento similar