• No se han encontrado resultados

Rel(ti, tj) = ~ti~tj (2.20)

Therefore, the term vectors ~tiand ~tjneed not be known so long as the similarity between terms

tiand tj (rel(ti, tj)) is known (Tsatsaronis & Panagiotopoulou 2009). Accordingly equation 2.19

can be rewritten as follows:

Sim(~q, ~d) = |q| X i |d| X j wiwjrel(ti, tj) (2.21)

Thus, equation 2.21 allows any approach to be used for obtaining rel(ti, tj). This way,

the GVSM provides a convenient framework where the computation of semantic relatedness (rel(ti, tj)) is separated from semantic indexing. This allows any effective approach for the com-

putation of semantic relatedness to be utilised for semantic indexing.

Semantic indexing using the GVSM model has been widely applied to text classification albeit sometimes without explicit reference to the name GVSM e.g. (Chakraborti, Wiratunga, Lothian & Watt 2007, Gabrilovich & Markovitch 2009, Nasir, Karim, Tsatsaronis & Varlamis 2011). Note also that LSI can be used with the GVSM where SVD is used for acquiring semantic relatedness between terms and GVSM is used for semantic indexing. This further demonstrates the advantage of separating semantic relatedness computation from semantic indexing. In the next sub-section, we present a detailed review of several approaches that have been proposed for semantic related- ness computation .

2.3

Supervised Semantic Indexing

The main limitation of conventional semantic indexing approaches for supervised tasks is that these techniques are agnostic to class knowledge. This means that the semantic representations produced using these approaches are not necessarily the best fit for the class distribution of the document collection (Aggarwal & Zhai 2012). This is a well recognised problem and a number of supervised extensions to traditional semantic indexing approaches have been proposed. We discuss the most popular of these approaches in the following sub sections.

2.3. Supervised Semantic Indexing 35 2.3.1 Supervised LSI

An extension of LSI called supervised LSI (SLSI) that iteratively computes SVD on term similar- ity matrices of separate class is presented in (Sun, Chen, Zeng, Lu, Shi & Ma 2004). A separate term-doc matrix is constructed for each class and in each iteration, SVD is performed on each class-specific term-doc matrix. The most discriminative eigen vector across all categories is se- lected as the basis vector in the current iteration. The effect of the selected eigen vector is then subtracted from the original term-document matrix. The iteration continues until the dimension of the resulting space reaches a predefined threshold. The evaluation compared three types of rep- resentations: standard BOW without semantic indexing, unsupervised LSI and SLSI using kNN and SVM classifiers. Results show SLSI performs better than LSI. However, SLSI only achieved marginal gains over BOW using kNN while both SLSI and LSI failed to perform better than SVM.

2.3.2 Sprinkled LSI

A more promising supervised extension to LSI which uses an approach called sprinkling where class-specific artificial terms are appended to representations of documents of the corresponding class (Chakraborti, Lothian, Wiratunga & Watt 2006). LSI is then applied on the sprinkled term- document space resulting in a concept space that better reflects the underlying class distribution of documents. An overview of the sprinkling process is shown in Figure 2.7.

Sprinkling involves generating a set of artificial terms for each class in the training corpus. Document representations is the term-document matrix D are then augmented with the artificial terms that correspond to their respective class. A higher order term-relatedness approach e.g. LSI is then applied on the augmented term-document space which results in stronger associations between terms that occur more often within documents of the same class. An important consider- ation for sprinkling is the number of artificial terms to sprinkle. In (Chakraborti et al. 2006), the authors found sprinkling 16 terms per-class to give optimal performance. A more sophisticated approach called adaptive sprinkling which optimises the number of sprinkled terms for each in- dividual dataset based on dataset complexity is presented in (Chakraborti, Wiratunga, Lothian & Watt 2007). Adaptive sprinkling exploits the confusion matrix of each dataset produced by a clas- sifier. A confusion matrix records the performance of the classifier such that the columns of the matrix represent the instances predicted by the classifier and the rows represent the actual instances

2.3. Supervised Semantic Indexing 36

Figure 2.7: Sprinkling.

that belong to the class. The non-diagonal entries of the confusion matrix therefore represent the instances the are misclassified by the classifier. The larger the entry in a non-diagonal cell, the harder that class is to the classifier. In this way, adaptive sprinkling allocates more artificial terms to the harder classes.

Sprinkled LSI was compared with unsupervised LSI and SVM on a number of classification tasks. Results showed sprinkled LSI to significantly out perform both unsupervised LSI and SVM. However, a major limitation of sprinkling and adaptive sprinkling is that both techniques are only applicable to higher order term relations. This is because the ‘sprinkled’ term-document space has no effect on first-order term relations. Therefore, there is a need for a more general approach for utilising class knowledge for semantic indexing. Particularly, we need a method that is inde- pendent of the type and order of semantic relatedness. Furthermore, adaptive sprinkling requires the number of artificial terms used for sprinkling to be optimised for each individual class which introduces a significant overhead if the number of classes is large.

2.3. Supervised Semantic Indexing 37

Figure 2.8: Graphical model of LDA and SLDA

2.3.3 Supervised LDA

A supervised version of LDA called sLDA is presented in (Blei & McAuliffe 2008). Here, a response variable (class label, real value, cardinal or ordinal integer value) associated with each document is added to the LDA model. Thus the topic model is learned jointly for the documents and responses such that the resultant topics are good predictors of the response variables. The difference between LDA and sLDA topic modelling approaches is illustrated in figure 2.8.

Figure 2.8 is a graphical model representation of LDA (top) and sLDA (bottom). As can be observed, the main difference between the two models is that sLDA includes a response variable Ydwhich is conditioned on the response parameters η and δ. This means that prediction is also

built into sLDA i.e. given any document, it is possible to predict the response variable Yddirectly

from the sLDA model without the need to use any classifier. Thus, sLDA is more than simply a semantic indexing technique.

2.4. Supervised Document Indexing 38

Documento similar