• No se han encontrado resultados

Several different attributes have been considered in previous works for capturing the characteristics of datasets. A common baseline approach is presented in (Lindner & Studer 1999) where several statistical measures are used to characterise datasets. Note that no motivation is given for the choice of characteristics or meta-attributes. The meta-attributes used include number of instances, number of features of the dataset, ratio of symbolic features, number of classes, default error rate, standard deviation of class distribution, relative probability of defective instances, number of records with missing values, relative probability of missing values and number of missing values.

3.2. Predicting When to use Semantic Indexing 61

Attribute Name Description

AveTermCount Average number of terms per document

MaxDocFreq Maximum term document frequency

AveDocFreq Average term document frequency

MaxIDF Maximum term Inverse Document Frequency

AveIDF Average term Inverse Document Frequency

Nearest Neighbour Similarity Average similarity of nearest neighbours

AveNSim Average neighbourhood similarity

MinNSim Minimum neighbourhood similarity

MaxNSim Maximum neighbourhood similarity

Table 3.4: Summary of dataset attributes used for meta-case representation.

Note that all of these meta-attributes are not useful for our task of predicting the performance of semantic indexing on text datasets. For example, term-document matrices are typically sparse with most feature values missing in any one document. Thus, it is unlikely that the measure of missing values is a good indicator of the performance of semantic indexing. Also, the number of instances and attributes are the same for all datasets, except the incident report datasets. Thus, these are also excluded from consideration as features. The authors also propose additional information theoretic features which are only applicable to symbolic dataset features and thus are not applicable for text datasets.

The authors in (Peng, Flach, Soares & Brazdil 2002), propose using meta-attributes created from measuring the characteristics of decision trees generated from the datasets. Here also, no justification was given for this choice of meta-attributes. This approach involves generating a decision tree from the dataset and then measuring attributes such as the number of nodes, number of branches and height of the decision tree. Given that our classifier of choice is kNN, it is not clear how useful the characteristics of a decision tree will be at predicting the performance of semantic indexing used with kNN.

The work in (Cummins & Bridge 2011) presents a meta learning approach for the selection of case-base maintenance algorithms. The meta-attributes used to characterise case-bases were chosen to model the complexity of these case-bases as case-base complexity is seen as the im- portant predictor of the performance of case-base maintenance algorithms. The meta-attributes considered are divided into three categories: Measures of Overlap of Attribute Values, Measures of Separability of Classes and Measures of Geometry, Topology and Density of Manifolds. Note that all the meta-attributes in the three categories are supervised, meaning that the class labels

3.2. Predicting When to use Semantic Indexing 62 of data instances (documents in our case) need to be considered. However, recall that the VSM and semantic indexing are not limited to supervised tasks. On the contrary, both the VSM and semantic indexing were originally designed for unsupervised document retrieval. Accordingly, it is highly desirable to consider unsupervised meta-attributes that are applicable for both supervised and unsupervised tasks.

Considering the limitations of the meta-attributes proposed in previous works, and the lack of strong motivation behind them, we propose a new set of meta-attributes. Recall that semantic indexing is applied to the term-document space representation of a document collection and not the actual document collection itself. Thus when selecting meta-attributes, we choose the types of attributes that are typically used for creating vector representations of documents e.g. term frequency and inverse document frequency. Also, because our classifier of choice is kNN, we use attributes that describe the neighbourhood structure of the datasets. A summary of the attributes we consider is presented in table 3.4. We describe these attributes in detail in the following sections. A table of the attributes and corresponding values used in our experiments is provided in Appendix D.

Average Terms Per Document

This is a measure of the average number of terms per document which is calculated after text preprocessing: stopwords removal, term normalisation and feature selection. Thus, the count of terms in a document is restricted to the terms from the indexing vocabulary. This is calculated as shown in equation 3.3.

T ermCount(di) =

X

tj∈T

di (3.3)

Where ti is a term in document di and T is the entire indexing vocabulary. The average term

count for the entire dataset is calculated by taking the average term count for all documents in the dataset as in equation 3.4.

AveT ermCount = P

di∈DT ermCount(di)

3.2. Predicting When to use Semantic Indexing 63 Document Frequency

The document frequency of a term ti is a count of the number of documents in which ti occurs.

Document frequency is often used as a feature selection technique under the premise that very rare terms are not informative and thus do not contribute much to document retrieval. At the same time, terms that appear in almost all documents are also not very discriminatory and can be considered noisy in the term document space. Such high frequency terms are also likely to co-occur with almost every other term thus polluting the generalisation process. Hence we utilise two metrics to measure the effect of document frequency: Maximum DF (MaxDocFreq) which is the maximum document frequency over all terms and Ave. DF (AveDocFreq) which is the average document frequency of over all terms.

Inverse Document Frequency

Inverse Document Frequency (IDF) is a function designed to give a weighting inversely propor- tional to the document frequency of terms. IDF captures the premise that terms with very high document frequency are less informative than terms that occur less often. The formula for IDF is given in equation 3.5 where N is the total number of documents and df (t) is the document frequency of t.

IDF (t) = log2

N

df (t) (3.5)

We use the Maximum IDF (MaxIDF) and the Average IDF (AveIDF) to obtain a measure of rare terms in our datasets.

Nearest Neighbour Similarity

We measure the tightness of the clustering of documents in a dataset using the distance between each document, and the other documents in its neighbourhood as shown in Figure 3.2. Nearest Neighbour Similarity of a document dj is calculated by iteratively retrieving successively larger

neighbourhoods k of dj up to the neighbourhood size K (we use K = 10) and computing the

similarity between dj and all documents in its neighbourhood. This is shown in equation 3.6.

Pk(dj) =

Pk

i=1Sim(dj, di)

3.2. Predicting When to use Semantic Indexing 64

Figure 3.2: Nearest Neighbour Similarity calculated using the distance of a target document dj to

its k nearest neighbours.

Where Sim(dj, di) is the cosine similarity between document dj and di. The final Nearest Neigh-

bour Similarity measure for the entire dataset is computed as the average Nearest Neighbour Sim- ilarity of all documents dj.

Neighbourhood Similarity

Figure 3.3: Neighbourhood similarity of document dj measures using the distance between k

nearest neighbours of dj.

3.2. Predicting When to use Semantic Indexing 65 nearest neighbours, this metric calculates the average pair-wise similarity between all k nearest neighbours of the target document dj as shown in Figure 3.3. We use a neighbourhood size

of k = 10. We then calculate the average, minimum and maximum neighbourhood similarity over all documents to obtain the Average Neighbourhood Similarity (AveNSim), Minimum Neighbourhood Similarity (MinNSim) and Maximum Neighbourhood Similarity (MaxNSim) respectively for that dataset.

The average similarity between the nearest neighbours of a document tells us how tightly clus- tered the neighbourhood of that document is. In turn, the aggregation over all documents provides us with information about how tightly clustered documents are in the entire term document space.

Documento similar