Para trabajar este tipo de situación, actualmente estamos

Feature selection refers to the process of selecting relevant features from text where typically each term (word/phrase) in the text represents a feature. The aims are to improve both the effectiveness of the classification and the efficiency in computational terms (by reducing the dimensionality) [84]. Mineret al. [74] categorised text mining feature selection approaches according to whether they were based on:

1. Information theory: Addressing the best way to process signals and compress and communicate data.

2. Statistics: Determining the statistical correlation between the terms and the class labels of the documents.

3. Frequency: Determining the importance of the terms based on their frequency and on the document frequency.

In relation to the Information Theory and the Statistics methods, Frequency methods are less computationally expensive. The most relevant approaches with respect to these categories and the work described in this thesis are described in the following subsections, namely: (i)Information Gain (IG), (ii)Chi-squared (χ2), (iii)Correlation- based Feature Selection (CFS) and (iv) Term Frequency-Inverse Document Frequency (TF-IDF). A justification of which feature selection methods were used for the work described in this thesis, and how they were used, is presented at the end of this section.

2.4.1.1 Information Gain (IG)

Information Gain (IG) is based on Information Theory, which is concerned with the processing and compression of signal and communication data, and was introduced in 1948 by Claude Shannon [91] who is considered to be the “father” of information theory. According to Mineret al. [74] IG “measures how much the uncertainty about the target variable, called entropy, is reduced when the feature is used”. In other words, given

that entropy is a measure of uncertainty with respect to a training set (or the amount of information required to assign a class label to an instance), IG is an indicator of how much information is gained from an initial to a new entropy of a feature. It is calculated as follows:

Gain(A) =Inf o(D)−Inf oA(D) (2.1)

where D is a data partition which comprises instances in a node N, which represents tuples of partitionD. The information required to assign a class label to an instance inD, in other words the entropy ofD, is given by:

Inf o(D) =−

i=1

pilog2(pi) (2.2)

where a class label can havemdifferent values andpiis the probability that an instance

belonging to D is related to a certain class. The information required to produce a correct classification is given by:

Inf oA(D) = v X j=1 |Dj| |D| ×Inf o(Dj) (2.3) |Dj|

|D| represents the weight of the jth partition. A feature with high IG has a better occurrence prediction of the target variable. Typically, features are ranked according to their IG and the features with higher values (which have a better prediction capability with respect to the class labels) are chosen.

2.4.1.2 Chi-squared (χ2) statistic

The Chi-squared (χ2) statistic measures the lack of independence between a feature and a class to which a document is related [111]. The more independent, the more irrelevant a feature is with respect to a certain class. The Chi-squared statistic is calculated for each term and, after ranking all the features, the most relevant are chosen. In the formal definition of Chi-squared, two features A and B are considered; they can have different values and are paired: (Ai, Bj), where A and B can take any value a or b

respectively, from 1 to c in the former and from 1 to r in the later. As explained in [41], the Chi-squared statistic is then calculated as:

χ2 = c X i=1 r X j=1 (oij−eij)2 eij (2.4)

where, with respect to (Ai, Bj), oij is the observed frequency and eij is the expected

frequency. eij is calculated as:

eij=

count(A=ai)×count(B =bj)

where N is the number of instances, count(A =ai) is the number of instances where

the value forA isai and count(B =bj) is the number of instances where the value for

B is bj.

In the comparative study of feature selection methods presented by Yang and Ped- ersen [111] the performance of the Chi-squared statistic is similar to IG when used as a ranking metric. Mineret al. [74] points out the correlation of the computational cost of the Chi-squared statistic with the size of the vocabulary.

2.4.1.3 Correlation Feature Selection (CFS)

Correlation Feature Selection (CFS) is used to identify and select sets of features which are “highly correlated with the class but with low intercorrelation” [107] in order to remove redundant or irrelevant features. Redundancy in this context is given by a feature being highly correlated with one or more features. As presented by Wittenet al. in [107], considering two nominal attributesA andB, their correlation is measured using symmetric uncertainty, which is defined as:

U(A, B) = 2H(A) +H(B)−H(A, B)

H(A) +H(B) (2.6)

where H represents the entropy function and H(A, B) the joint entropy of A and B. Symmetric uncertainty can take values between 0 and 1. The relevance of a set of features using CFS is determined by:

CF S = m X j=1 U(Aj, C) v u u t m X i=1 m X j=1 U(Ai, Aj) (2.7)

where the C in the numerator indicates the class and the (Ai, Aj) indicates a pair of

attributes in the set of features. If in a selected set of features there is a correlation between all them attributes and the class, the numerator (the total symmetric uncertainty) is then m and the denominator √m2_{, thus the CFS value will be 1, which is}

the maximum symmetric uncertainty value that can be obtained. In other words it is not possible to distinguish between classes. It is therefore better to focus on smaller subsets of features in order to find subsets with low symmetric uncertainty that are highly correlated with a class label but have a low correlation between them.

2.4.1.4 Feature Weighting with Term Frequency-Inverse Document Fre- quency (TF-IDF)

The Term Frequency-Inverse Document Frequency (TF-IDF) statistic weights terms by combining how frequent a term is in a document (TF) with how rare the term is with respect to the entire document set (IDF) [8]. TF-IDF is calculated as:

T F −IDF(d, t) =T F(d, t)×IDF(t) (2.8)

wheredrepresents a document,trepresents a term, TF is the term frequency and IDF is the inverse document frequency. Term Frequency (TF) is the number of occurrences of a term (feature) in a document and is calculated as:

T F(d, t) = |d| X

i∈d

1{di=t} (2.9)

Document Frequency (DF) is the number of documents that contain a particular term. Inverse Document Frequency (IDF) [63], on the other hand, address the issue of DF not being a good discriminator by considering the importance of terms in relation to the total number of documents and to the number of documents in which the term is contained. IDF is calculated as:

IDF(t) = log1 +|d| |dt|

(2.10)

wheredis the total number of documents and dt is the number of documents in which

the termt is contained. The resulting TF-IDF weight is assigned to each unique term in the document set and all the terms are ranked from the highest to the lowest weight value indicating their relevance. A user defined thresholdk is used to select the topk terms.

2.4.1.5 Feature selection methods used

For comparison purposes with respect to the summarisation techniques proposed in this thesis two alternative feature selection techniques were considered: (i) Term Frequency- Inverse Document Frequency (TF-IDF) [63] plus Chi-squared [113] and (ii) TF-IDF plus Correlation-based Feature Selection (CFS) [40]. Both combine TF-IDF with another feature selection technique, namely Chi-squared and CFS. Chi-squared was chosen because it is an established and widely used feature selection method that calculates the Chi-squared statistic of each feature in relation to a given class in order to identify the features that have relevance with respect to the class. CFS, on the other hand, identifies subsets of uncorrelated features amongst each other that, as a subset, are highly correlated with a class. CFS is not as widely used as Chi-squared but presents an interesting and different idea with respect to the selection of relevant features. In- formation Gain was not used because experiments (not reported in this thesis) were found to produce a very similar performance to that obtained using Chi-squared, thus corroborating the study by Yang and Pedersen [111].

The reasons for using two different feature selection techniques in conjunction with TF-IDF were: (i) to see how well they performed in conjunction, (ii) to demonstrate

the effectiveness of combining TF-IDF with other feature selection methods to improve the selection of relevant attributes and (iii) to compare how well they performed with data sets containing different types of data. Due to the different nature of the feature selection techniques used in conjunction with TF-IDF, different search methods were used in each case: (i) a ranking search method in the case of Chi-squared and (ii) a genetic search method in the case of CFS.

In document El arco iris del deseo (página 129-134)