Càtars i trobadors - Les 7 cares del Pedraforca

In addition to Proposal1 which replaces NSGA-II in the Original method with controlled elitism NSGA-II, we describe in this section another version called Proposal2. In this proposal, we make some changes to the way the chromosomes of the initial population are generated. Specifically, we change Step2 (2) in initial population generation related to replacing the selected antecedent conditions with “don’t care” fuzzy sets. Figures 4.4 and 4.5 show the pseudo-code of “don’t care” replacement procedure for Original (and also used in Proposal1) and Proposal2, respectively. In Original method, replacing an antecedent with don’t care takes place only if the probability of a randomly generated number 𝑃_𝑟𝑛𝑑 is more or equal to 𝑃𝑑𝑜𝑛𝑡 𝑐𝑎𝑟𝑒. This procedure is a kind of feature selection where the attributes which are replaced with “don’t care” antecedents are considered as “non-selected features” whereas the others which keep their fuzzy set values as “the selected ones”. In this case, we can say that all the attributes have the same probability to be selected or replaced with “don’t care” antecedents.

In our proposed method, however, we choose a different approach that gives each attribute its own probability of selection based on its importance or relevance. The more the feature is more relevant, the more likely to be selected; or in other words, less likely to be replaced by the “don’t care” antecedent condition. To calculate the probability of features we used the following feature selection procedure.

University

102

Figure 4.4 Pseudo code for Step2 (2) in initial population generation which replaces some fuzzy antecedent conditions with “don’t care” fuzzy sets in Original and Proposal1

Figure 4.5 Pseudo code for Step2 (2) in initial population generation which replace some fuzzy antecedent conditions with “don’t care” fuzzy sets in Proposal2

University

103

Feature selection procedure

Feature selection is the process of reducing the number of features by selecting a meaningful smaller subset of these features.

Feature selection methods are generally divided into two approaches: wrapper and filters methods (Das, 2001; Kohavi & John, 1997). Wrapper methods are classifier- dependent methods as the selected features are specifically chosen by a particular classifier and thus they are generally more accurate but have some limitations. In addition to their higher computational cost, they are overly specific for the classifier used which likely to render the selected features suboptimal solution for other learning methods (G. Brown, Pocock, Zhao, & Luján, 2012). Filter methods, on the other hand, are classifier-independent, that is, they evaluate the features independently of any particular classifier usually by a scoring criterion, which make them generic and applicable by any classifier. In addition, filter methods are faster and less likely to overfit compared to Wrapper methods (G. Brown et al., 2012).

Filter methods generally rank features according to their individual predictive power using statistical measures (G. Brown et al., 2012). A common approach is to use the Mutual Information between the feature and class label (Gavin Brown, 2009).

Two different feature selection methods may give you different sets of features. In this case, presenting only one set of features selected by a given feature selection method can be misleading (Ludmila I Kuncheva, 2007).

To increase the probability of selecting the appropriate features, we use seven feature selection methods rather than one to produce seven feature set candidates. Then, we calculate the average ranks for each feature over the 7 methods. Finally, we get the probability value of each feature in a way that corresponds to its average ranks. Figure 4.6 shows the pseudocode for calculating the average probability 𝑃𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒(𝑖) for each feature. Comparing with the approach used in Original, this method favors features

University

104

which have good average ranks over the 7 feature ranking methods. This is a kind of guided selection compared to the random selection applied in Original and Proposal1. Factor in the pseudocode listed in Figure 4.6 is an integer which can be used for determining the number of features to be selected. The bigger the value of Factor is, the less the number of antecedent conditions will be selected.

For sonar data set, we make some modifications to our method because the number of features is 60 which is much higher than the other data sets. We calculate the probability of features with the same way as in Figure 4.6 but we apply guided selection on only the top half of the features and a random selection on the remaining features (see Figure 4.7). In addition, we apply also this method, in Step3 of the Michigan style iteration, to the rules generated using Michigan-style approach (in Michigan-style iteration - Step3) with probability 𝑃_{𝑀𝑖𝑐ℎ}. The reason behind this hybrid approach for data sets with a very high number of features is that, if we apply guided selection of features only, some of the attributes would never be selected which results in less diverse solutions while applying random selection only may cost a lot of time for the GA to converge to the optimum solutions.

Figure 4.6 Pseudo code for calculating the average probabilityfor each feature

University

105

Feature selection methods

The following is a brief description of the seven feature selection methods used for ranking the attributes in the data sets. For more details about these methods, please refer to (G. Brown et al., 2012). The authors have also provided a MATLAB toolbox that includes these methods.

Mutual Information Maximisation (MIM)

This method, which was applied in (Lewis, 1992), scores each feature independently of others and then ranks the features according to their mutual information. The user will then select a subset of the top ranked features based on some criteria. This method has been frequently used in the literature (G. Brown et al., 2012).

Figure 4.7 Pseudo code for Step2 (2) related to replacing selected fuzzy antecedent conditions with “don’t care” fuzzy sets in Proposal2 algorithm for sonar data set

University

106

Conditional Mutual Info Maximisation (CMIM)

CMIM employs a trade-off between the relevancy, or feature power, and redundancy, also known by independence, in the search of the most discriminative features. The selection process is performed by iteratively picking features which maximize their mutual information with the predicted class but with the condition that the newly added feature is not similar to the previously selected features (Fleuret, 2004). In other words, the feature which is not carrying additional information, even it is powerful, is not selected.

Joint Mutual Information (JMI)

Another approach to reduce the redundancy is by using the Joint mutual Information

(JMI) measure used by (Yang & Moody, 1999). The idea is to increase the complimentary information between features (G. Brown et al., 2012). This method,

based on JMI, is simple but effective in reducing redundancy in the features which overcome one of the limitations of Mutual Information approach.

Double Input Symmetrical Relevance (DISR)

This method which was proposed in (Meyer & Bontempi, 2006) has used the same concept of complementary information between the features to avoid redundancy as in JMI but used different criterion measure called the symmetric relevance.

The method is called double input symmetrical relevance (DISR) as it measures the symmetrical relevance on all combination of two features (Meyer & Bontempi, 2006).

Interaction Capping (ICAP)

ICAP was proposed by (Jakulin, 2005) and it originates from the idea that a single attribute can be considered irrelevant to the class but when combined with other features, it becomes very relevant. The author proposed to use interaction gain as a measure for detecting attribute interaction. Thus, in the absence of its interacting features, a feature could lose its relevance.

University

107

Conditional redundancy (condred)

This method was proposed in (G. Brown et al., 2012) for a comparison purpose.

Conditional Informative Feature Extraction (CIFE)

CIFE was introduced by (D. Lin & Tang, 2006) based on two key concepts: class- relevance and redundancy. It aims to maximize the joint class-relevant information by reducing the class-relevant redundancies among features (D. Lin & Tang, 2006).

University

108

In document Les 7 cares del Pedraforca (página 20-0)