HECHOS RELE VANTES - M E M O R I A A N U A L C O R P O R A T I V A

A recent scheme for list reorganisation is impact-ordering [Anh et al., 2001; Anh and Mof- fat, 2002a;b; Anh, 2004; Anh and Moffat, 2005b]. In the impact-ordered indexing scheme, postings are ordered within a list by decreasing effect on the ranking function.

Two general approaches to impact-ordering are proposed. The first utilises a collection- oriented approach to determine the value of each impact, where impacts are based on the contribution of the posting to the similarity metric. The second view, takes a document- centric approach in determining impacts, where each impact is based on the significance of the term within the document in which it occurs. In both cases, impacts reflect the contribution of the term to the similarity metric, however in the case of collection-oriented impacts, the relationship is direct, whereas in the document-centric approach it is implicit.

In Chapter 3 we compare our novel index ordering approach to impact-ordering. Collection-Oriented Impacts

In a collection oriented impact-ordered index, each posting is a tuple hd, md,ti where md,t is the impact of term t in document d, and is defined as:

md,t= wd,t .

The value wd,tis the weight of term t in document d as determined by the document similarity measure employed in the search engine. Anh and Moffat [2002b] found that the pivoted cosine metric as defined by Singhal et al. [1996b] worked best with impact-ordering. Using this metric the impact of a term in a document is given as:

md,t=

1 + log_efd,t (1 − s) + s . Wd/WA

The constant S is usually set to 0.7, Wdis the weight of document d, and WA is the average value of Wd across all documents in the collection. The value of md,t as defined above is a floating point value that can be used directly in the similarity measure as follows:

sd,q = X

t∈d∩q

mq,t . md,t ,

where the similarity between a document d and query q is the sum of the products of the

document-term impact md,t and the query-term impacts mq,t for each term that appears in

both the document and the query. For compactness, we do not define query term impacts in detail here, but the principles follow that of document impacts.

fairy h642, 24i h649, 12i

farewell h434, 31i h47, 23i h464, 1i

fish h386, 25i h430, 21i h750, 13i h283, 1i h436, 1i h480, 1i state h77, 18i h606, 13i h79, 1i h471, 1i

storm h408, 21i h386, 19i h22, 1i swords h566, 15i h565, 14i

sycorax h159, 17i h511, 17i h128, 1i h132, 1i h136, 1i vision h612, 1i h624, 1i h718, 1i

Figure 2.6: Postings of selected terms in an impact-ordered document-level index for the Tempest collection.

The impact-ordered approach encompasses much more than a method of organising postings. As part of their body of work, the authors propose normalising impacts so that all terms in short web queries contribute significantly towards similarity computations. In addition, they propose mapping the normalised impact values into fixed precision integers using a uniform quantisation scheme. Together, these steps have three advantages: first, quantisation allows integers to be stored instead of floating point values; second, the uniformly quantised impacts can be used directly in the similarity measure as surro- gates for the actual impacts; and last, normalisation was found to improve search accuracy.

As with frequency-ordering, as postings are ordered by decreasing impact value, the differences between adjacent values can be stored. In addition, when postings share the same impact, they are sorted by document identifier and differences between postings in each equi- impact block are stored instead of the full document identifiers. Finally, as the normalised quantised impacts are used directly in the similarity measure, there is no need to store the within document term frequency fd,t values per posting.

Figure 2.6 shows the quantised impact-ordered lists for selected terms in the Tempest collection. Note that for each list, the postings are ordered by decreasing impact, so that at query time, those postings that are likely to contribute the most to the similarity metric will be processed first. Further, although there is a correlation between the document frequency of a term and its impact, this is not always the case. For example, although document 386 contains term “storm” three times, its impact in that document is less than that of document 408 that only contains the term “storm” twice.

To reduce query processing time, impact-ordering utilises a heuristic that determines when to abandon list processing. Anh and Moffat proposed two new schemes: first, term- finewhere a penalty is applied to the impact of each query term, and this penalty is increased as each new term is processed; and, second, block-fine, where a penalty is applied to posting contributions, and the penalty is increased as each new impact block is processed. For both schemes, the penalty is applied to the posting contribution before updating the relevant accumulator. Processing is abandoned when the penalised contribution of a posting falls below zero.

The benefits of these heuristics are two-fold: first, list processing can now be abandoned before processing an entire postings list; and second, for long queries with many terms, the penalties re-establish the differences between posting contributions that are diminished by the normalisation step. This is important in long queries as terms that do not discrimi- nate well between documents — that is, traditionally those that have a low impact — may have high impacts after normalisation, and without penalties may dominate the accumulator contributions for a document.

Use of a combination of term-fine, block-fine, and the continue strategy discussed in the previous section, produced the best compromise between speed and accuracy. On the same web collection we use, Anh and Moffat [2002b] report relative accuracy improvements of 23%–43% over a pivoted cosine baseline.

There are three disadvantages to collection-oriented impact-ordering. First, Anh notes that impact-ordering is as yet ineffective for state-of-the-art Okapi BM25 measure due to the inability to separate query term and document term impacts [Anh, 2004]. Second, impact- ordering is dependent on the ranking function, making it difficult to change or tune the ranking function without rebuilding the index. Finally, there is an unclear relationship between normalisation and the fine strategies: one increases the impact of terms, while the other lowers it.

Document-Centric Impacts

The document-centric approach to impacts simplifies the assignment of impact values to each posting. Each posting remains a tuple hd, md,ti where md,t is the impact of term t

in document d, but now the impact md,t is assigned based on the importance of the term

within the document [Anh and Moffat, 2005b]. Terms within a document are ranked by importance, and the number of terms that are assigned a given impact is determined by a

Figure 2.7: Process of impact value allocation to the terms in a document. First, the frequency of occurrence of each unique term in a document is established. Second, terms are ordered by decreasing frequency within the document. Third, impacts are assigned using a geometric distribution.

geometric distribution given by:

xi = B · xi+1 , where:

B = (|d| + 1)1k .

The equation makes use of the number of terms in the document |d|, and the number of impact buckets that are allowed k. Anh and Moffat suggest eight impact buckets, that is, k = 8.

Anh and Moffat explored various ranking techniques to assign impact values to each term,

ranging from ranking by the value of the product of document term frequency, fd,t, and

inverse document frequency, N_f

t; to ranking by term frequency alone; and variants combining

both, with one as the primary sort key, and the other as a secondary key. They suggest an ordering based on a primary key of document term frequency, with a secondary key of inverse document frequency performs best, but note that a primary sort key of document term frequency alone is comparable. A problem with the dual key approach is that it requires some collection oriented information to produce impacts, while the single key term frequency ranking does not.

Figure 2.7 illustrates the process of impact assignment with a document-centric approach. Here we see that impacts are assigned based on the importance of the term to the single document. Unlike collection-oriented impacts, collection wide statistics are not necessary to assign the impacts. With all the terms in a document assigned an impact, the postings can

be added to the inverted lists. Once impact values have been assigned to each term in each document, the inverted lists are reordered by impact and query processing continues in the same manner as the collection-oriented impact-ordered index.

The benefits of document-centric impact-ordering are that collection-wide statistics such as inverse document frequency are not necessarily required to assign impacts. Further, the addition of new documents to the collection does not require the recalculation of all the existing impacts.

In document M E M O R I A A N U A L C O R P O R A T I V A (página 142-149)