– Moneda nacional y moneda extranjera - SIGDO KOPPERS S.A. Y FILIALES (MILES DE DOLARES DE LOS

evaluation methodology and comments the results. In Section 4.4 we describe the related work and finally conclusions are reported in Section 4.5.

4.2 Cluster Based Relevance Modelling

In order to get a better pseudo-relevant set we formulated a new cluster based re-ranking function. Every re-ranking approach has a set of common-steps: obtaining of the initial retrieval, selection of the set of documents to re-rank, the re-ranking process itself and, optionally, further processing based on the re-ranked results.

For the initial retrieval and aligned with the topic of this thesis we chose the high performance LM framework. More precisely, we performed the initial retrieval assuming a multinomial model with Dirichlet smoothing (see Eq. 2.7). The second step is to fix the top of documents subject to be re-ranked (dinit). We will refer to the size of this top hereinafter as the parameter N . The next phase in a cluster based technique is to perform the document clustering. For our proposal, we chose a clustering algorithm with overlapping. Once the top documents are clustered, we calculate the cluster query likelihood for every resulting cluster. Finally, the clusters query likelihoods and the documents query likelihoods are combined by our proposed retrieval formula and the documents are reranked according to the new scores. And lastly, the top documents of the new ranking are used as PRS to feed a query expansion process based on RM.

For the discussion of how to perform the initial retrieval we refer to the reader to Chapter 2. Next, we will address those issues of our proposal not commented before, namely: the clustering algorithm, the estimation of the cluster query-likelihood, the re-ranking process and the application of RM.

4.2.1 Clustering Algorithm

Once that the initial ranking is obtained, clustering is performed over the top Ndocuments. The use of clustering algorithms with overlapping has already been demonstrated successful (Kurland and Lee, 2004) in cluster based retrieval. Indeed, initial approaches to query specific clustering (Liu and Croft, 2004) were not conclusive and it was only after incorporating clustering algorithms with overlapping (Liu and Croft, 2008) when the results were im- proved. As we explained before, one of the main points of our method is

36 Chapter 4. Cluster Based Retrieval and Relevance Models

to use the information provided by bad clusters to avoid non-relevant documents in the pseudo-relevant set. In order to do this we used a clustering algorithm that supports overlapping, i.e. one document can belong to one or more clusters.

The straightforward selection based in previous works could be using a k- nearest neighbours (k-NN) algorithm, but the k-NN forces each document to have exactly k neighbours. This aspect is not desired in our approach because we will exploit that a document belongs to a low scored cluster. If we had used k-NN, a non-relevant document with low query likelihood and no close neighbours could attract other documents that, although they are not close to that document, they are the closest ones.

So we decided to cluster the documents in base to a given threshold t, grouping for each document those neighbours that are more similar than t. Let’s call this way of grouping thr-NN. Given a document di, its neighbour- hood is the set of documents dj such that sim(di, dj) ≥ t. The purpose of this algorithm is that non-relevant documents could be isolated in singletons (Lu et al., 1996).

Term Frequency-Inverse Document Frequency (tf · idf ) was used as document representation in the clustering algorithm. tf · idf measures the im- portance of a term to describe a document not only based on the number of times that it appears in the document, but also to the number of documents in which it appears. A term which appears very rarely in the collection should be given more weight for describing a document, as it is very specific. So the weight of the term wi in the document dj was computed as in Eq. 4.1:

weight(wi, dj) = tf (wi, dj) · log |C| df (wi)

(4.1) where tf (wi, dj)is the raw term frequency of the term wi in the document dj, |C| is the number of documents in the collection and df (wi)the document frequency of the term wi.

For the similarity measure between documents (sim(di, dk)) we choose traditional cosine distance as in Eq. 4.2

sim(di, dj) = P|V | k=1(weight(wk, di) · weight(wk, dj)) q P|V | k=1weight(wk, di)2 q P|V | k=1weight(wk, dj)2 (4.2)

4.2. Cluster Based Relevance Modelling 37

ferent terms in the collection.

4.2.2 Cluster Query Likelihood

In order to exploit the cluster information in our retrieval approach we need a way of estimating the cluster query likelihood. In the origin, the first approaches to cluster retrieval considered the clusters as meta-documents, i.e. one cluster is represented as the concatenation of the documents that belong to it (Kurland and Lee, 2004; Liu and Croft, 2004; Lee et al., 2008), or the centroid of the cluster (Voorhees, 1985). But these representations suffer from several problems because of the document and cluster sizes. As demonstrated in Liu and Croft (2008), the geometric mean is a better cluster representation in order to calculate the cluster query likelihood, so it was chosen in our approach. The cluster query likelihood based on the geometric mean representation was calculated combining equations 4.3 and 4.4.

P (q|C) = n Y i=1 P (qi|C) (4.3) P (w|C) = |C| Y i=1 P (w|di) 1 |C| (4.4) where n is the number of query terms, |C| is the number of documents in the cluster C, and P (w|di)was computed using a Dirichlet estimate. So finally the cluster query likelihood applying logarithmic identities can be calculated as in Eq. 4.5 P (q|C)rank= n Y i=1 e P|C |

i=1log P (w|di)

|C| _(4.5)

4.2.3 Cluster Based Reranking

Previous approaches to cluster based re-ranking only used the presence of a document in a good cluster as indicator of its relevance. As previous explained these approaches when used to construct pseudo-relevant sets for further processing with query expansion, suffer from the problem that even the good clusters are not one hundred per cent composed of relevant documents. The inclusion of non-relevant documents in the relevance set can

38 Chapter 4. Cluster Based Retrieval and Relevance Models

produce a poor performance of the query expansion process resulting in effectiveness degradation for that query.

The final objective of our approach is to reduce the number of non-relevant documents in the pseudo-relevant set. To achieve that point we decided to use the information given by the bad clusters. Our hypothesis is that given two documents d1and d2, and being C1max, C1min, C2maxand C2minthe clusters with best and worst query likelihood to which d1and d2 respectively belong to, if P (q|C1max) = P (q|C2max)and P (q|d1) = P (q|d2)then if P (q|C1min) > P (q|C2min)should indicate that d1is likely to be more relevant than d2. In other words if a document belongs to low clusters in the cluster ranking, it should be a pseudo negative indicator about its relevance.

So in order to produce a document ranking we decided to combine the document query likelihood, with the pseudo positive information in terms of best cluster, and the negative in terms of the worst cluster to which the document belongs. The query likelihood combination is presented in Eq. 4.6.

P0(q|d) = P (q|d) × max d∈Ci

P (q|Ci) × min d∈Ci

P (q|Ci) (4.6) where P (q|d) was estimated as in Eq. 2.7 and P (q|Ci)was estimated as in Eq. 4.5. This estimation alleviates to some point the problem of previous approaches that leave the cluster reranking as is, trusting in the relevance of every document inside of high ranked clusters.

Ideally removing all the non-relevant documents from the relevant set would have a great impact in order to get better expanded queries and, as a consequence, to improve the final retrieval effectiveness. Even although some relevant documents could be penalised because they group with other ones which appear low in the ranking, this effect will be extensively compensated by the benefit of removing the non-relevant documents from the relevance set.

Once that we compute the cluster-based reranking of the top N documents in the initial retrieval, we can use this altered ranking to feed traditional pseudo-relevance feedback methods. In this case, we will test RM3 (as explained in Chapter 2) in consonance with the objective of this thesis.

In document SIGDO KOPPERS S.A. Y FILIALES (MILES DE DOLARES DE LOS ESTADOS UNIDOS DE NORTEAMERICA) (página 62-177)