• No se han encontrado resultados

2.8 MADURACIÓN DE LOS QUESOS

2.8.5 FACTORES GENERALES QUE AFECTAN LA MADURACIÓN DE LOS

This Chapter presented four versions of the Multi-Label Correlation Based Fea- ture Selection (ML-CFS) method, based on hill climbing search. The first version of ML-CFS [57] extends the single-label CFS method to the more complex multi-

Table 4.41: Summary of overall average ranking (AR) gmiML-CFS and other multi-label feature selection methods across four feature space sizes using BPMLL as the classifier

Dataset Overall Average Rank (AR) across 4 feature space sizes

NoFS BR(BPNN) CFS-U RFML gmi-ML-CFS

CAL500 3(3) 3.6(5) 2(2) 3.2(4) 1.8(1) Scene 4.6(5) 3.2(4) 1.4(1) 1.8(2) 3(3) Emotions 3.2(4) 1.2(1) 2.4(3) 5(5) 1.8(2) Yeast 4(5) 1.2(2) 3.4(3) 3.6(4) 1(1) Enron 4.2(4) 1.4(1) 4.8(5) 3(3) 1.6(2) Medical 5(5) 3.6(4) 1.6(1) 3(3) 1.8(2) Business 5(5) 1(1) 2.7(2) 3.6(4) 2.7(2) Art 4.8(5) 1(1) 4.2(4) 2.3(2) 2.7(3) Education 5(5) 1(1) 3.8(4) 3(3) 2.2(2) Recreation 5(5) 1(1) 4(4) 2.7(3) 2.3(2) Health 5(5) 1(1) 3.9(4) 3.1(3) 2(2) Ent.ment 5(5) 1(1) 3.8(4) 3.2(3) 2(2) Computer 5(5) 1(1) 4(4) 2.7(3) 2.3(2) Science 5(5) 1(1) 4(4) 2.1(2) 2.9(3) Average 4.56(4.71) 1.59(1.79) 3.29(3.21) 3.02(3.14) 2.15(2.07)

label classification scenario by computing the correlation between a feature and each of the multiple class labels. Then other three extensions of ML-CFS were pro- posed [58] namely; (1) ML-CFS with the Absolute Value of Correlation Coefficient (ML-CFSabs), (2) the ML-CFS version where class labels with greater mutual in- formation (with respect to other labels) are assigned greater weight when comput- ing feature-label correlations (gmiML-CFS); and (3) the ML-CFS version where class labels with greater mutual information are assigned smaller weights (smiML- CFS). Importantly, both gmiML-CFS and smiML-CFS also use the absolute value of correlation coefficient, since ML-CFSabs obtained in general substantially bet- ter results than the first version of ML-CFS.

We have run experiments with those four versions of ML-CFS and other multi- label feature selection methods to compare the predictive accuracy associated with their selected features when those features are used by two well-known multi-label classification algorithms: ML-kNN and BPMLL. From the experimental results reported in this Chapter, gmiML-CFS clearly outperforms ML-CFS, ML-CFSabs and smiML-CFS in general. Moreover, when comparing gmiML-CFS with other

multi-label feature selection methods, gmiML-CFS still shows a good predictive performance (it obtained the second best predictive accuracy out of five feature selection approaches) when using both classifiers. In addition, gmiML-CFS selects substantially smaller feature subsets than other methods which obtained the best predictive accuracy with both classifiers.

Chapter 5

Multi-Label Correlation-Based

Feature Selection Methods that

Exploit Biological Knowledge

In chapter 4, we proposed several versions of the Multi-Label Correlation-Based Feature Selection method (ML-CFS) and applied it to 14 multi-label datasets from a number of different application domains. In this Chapter we present extended versions of ML-CFS that exploit cancer-related information, in order to select a better set of genes (features) for cancer-related microarray datasets. This Chapter is organized as follows. Section 5.1 describes the general information about KEGG pathway. Section 5.2 describes three different versions of ML-CFS using KEGG pathway information. Section 5.3 describes the multi-label microarray datasets used in our experiments and Section 5.4 describes the experimental methodology. Section 5.5 reports experimental results and Section 5.6 presents this Chapter’s conclusion.

5.1

A Feature Subset Evaluation Function for

Exploiting Biological Knowledge

Recall that the original ML-CFS method evaluates the quality of a candidate feature subset by using a merit function, which rewards features that are highly correlated with the class attributes and have a low degree of redundancy with re- spect to other features. Hence, the merit function was designed to be independent from the application domain. Hence, in the context of the microarray datasets analyzed in this Chapter (datasets described in Section 5.3), the merit function has the limitation that it does not incorporate any biological knowledge about cancer-related genes. To improve the predictive accuracy and the potential for biological interpretation, int the context of cancer-related microarray datasets, we propose to extend the ML-CFS method with an evaluation function that uses some biological knowledge about cancer-related pathways.

Intuitively, the use of such biological knowledge would allow the ML-CFS method’s search to focus on genes which are already known to be cancer-related, which could help to improve the predictive performance associated with the ML- CFS method or help to select genes whose role in cancer-related drug resistance or sensitivity is more likely to be meaningful to biologists.

More precisely, we use knowledge about cancer-related KEGG pathways, which is a well-known type of biological pathway, as part of the function that evaluates a candidate feature subset. [61, 84, 85].

A KEGG pathway is a set of genes or proteins and their interactions, broadly represented in the form of a graph. Each node typically represents a gene or pro- tein, and an edge represents a type of interaction between genes or proteins. Some edges denote that a gene activates another, other edges denote that a gene or pro-

tein inhibits the activity of another, etc.

Moreover, KEGG pathways cover a wide range of organisms and are easy to use because each pathway is stored in well-known formats such as XML format files, text files and so on. KEGG pathways are widely used in literature [7, 41, 63].

Note that we utilize only 16 cancer-related KEGG pathways, which were se- lected based on current knowledge about the biology of cancer. The selection was made by Prof. Michaelis (School of BioSciences at University of Kent), an expert in cancer biology. Our experiments aim to select genes which are relevant for pre- dicting drug sensitivity/resistance in cancer patients. So, it would not be effective to employ all pathways in the KEGG database. The selected 16 cancer-related KEGG pathways are:

• DNA replication

• Base excision repair

• Nucleotide excision repair

• Mismatch repair

• Homologous recombination

• Non-homologous end-joining

• Fanconi anemia pathway

• ABC transporters

• Wnt signaling pathway

• Notch signaling pathway

• Hedgehog signaling pathway

• Cell cycle

• Apoptosis

• p53 signaling pathway

• Pathways in cancer

Detailed information about these cancer-related pathways is provided on the KEGG website (http://www.genome.jp/kegg/). We assume that if some genes are related with cancer-related drug resistance/sensitivity, they are likely to occur in some of the above cancer-related pathways.

In order to quantify the strength of the relationship between the genes in a can- didate feature subset and the aforementioned cancer-related pathways, we propose to compute “the Average Relative Frequency of Pathways per gene” (AvgRFP):

AvgRF PF SSi = Pk

f=1RF Pf

k (5.1)

where the average is computed over all the k features selected to be included in the i-th candidate feature subset (F SSi), as shown in Equation (5.1).

For each selected feature f inF SSi, the relative frequency of pathways for f, de-

noted by RF Pf, is the number of cancer-related KEGG pathways in which the

gene corresponding to f occurs divided by the number of user-specified pathways (16 in our case). EachRF Pf has a value in [0..1], so AvgRF PF SSi also has a value

in [0..1]. Hence, the AvgRF P term rewards feature subsets where most genes in the subset are involved in several cancer-related pathways, and penalizes feature subsets where most genes do not occur in any cancer-related pathway.