2.4
Relevance Models
Relevance-Based Language Models (Popularly, Relevance Models or RM for short) (Lavrenko and Croft, 2001) are among the best-performing ranking techniques in text retrieval. They were devised with the aim of explicitly introducing the concept of relevance, intrinsic to the probabilistic model of IR, in statistical Language Models. In fact, both LM and Probabilistic mod- els have been directly connected by assuming in the Probabilistic framework that P (d, q|R = ¯r) = P (d|R = ¯r)P (q|R = ¯r) and P (d, R) = P (d)P (R) (Lafferty and Zhai, 2002). Relevance Models achieve state-of-the-art perfor- mance in terms of effectiveness for the pseudo-relevance feedback task. RM have been established as high-performance PRF approaches showing great improvements over the results obtained with the initial ranking. Since this approach was originally presented by Lavrenko and Croft (2001) it has been used in combination with other methods such as the employment of query variants (Collins-Thompson and Callan, 2007), cluster based retrieval (Lee et al., 2008), passage retrieval (Li and Zhu, 2008) or sentence retrieval (Bal- asubramanian et al., 2007).
The RM approach builds better query models using the information given by the pseudo-relevant documents. A formal definition of relevance model could be a mechanism that determines the probability P (w|R) of observing a word w in the documents relevant to a particular information need (Lavrenko and Croft, 2001). Given an accurate model of relevance R, if we want to rank a set of documents to be presented to the user according to the Proba- bility Ranking Principle (PRP)(Robertson, 1997) the best rank would be con- structed by sorting the documents according to the posterior probability of their belonging to the relevant class R. This is equivalent to rank the docu- ments by the odds of being observed in the relevant class: P (d|R)/P (d| ¯R). Under the word independence assumption the rank can be computed as:
P (d|R) P (d| ¯R) ∼ Y w∈d P (w|R) P (w| ¯R) (2.8)
Only one question remains to be answered, how to learn the relevance model R. This is equivalent to answer the following question: given an unknown process R from which we have sampled every query word q1. . . qnafter n (the query length) times, what is the probability that the next word we sample will
16 Chapter 2. Introduction to Relevance Models be w?. P (w|R) ≈ P (w|q1. . . qn) = P (w, q1 . . . qn) P (q1. . . qn) (2.9) The objective now is to estimate the joint probability of observing the word wand the query terms together (the numerator of Eq. 2.9). The denominator of Eq.2.9 can be computed as P (q1. . . qn) =PwP (w, q1. . . qn).
Two estimations were originally presented (Lavrenko and Croft, 2001). RM1 assumes that the words in the relevant documents and the query words are sampled identically and independently from the relevance model (i.i.d. sampling). The steps of the derivation can be observed in the original paper and the result is an estimation where the query likelihood for every document is used as the weight for the document and the probability of a word is av- eraged over every document language model. In contrast, RM2 assumes that the query words are independent of each other, but they are dependent of the words of the relevant documents (conditional sampling). The result is that relevant documents containing query words can be used for computing the association of their words with the query terms. A quite detailed explanation of the RM for PRF is given in the Chapter 7 of the book Croft et al. (2009).
In RM the original query is considered a very short sample of words ob- tained from the relevance model (R). If more words from R are desired then it is reasonable to choose those words with highest estimated probability when considering the words for the distribution already seen. So the terms in the lexicon of the collection are sorted according to that estimated probability, which after doing the assumptions using the RM1 method, is estimated as in Eq. 2.10. P (w|R) ∝X d∈C P (d) · P (w|d) · n Y i=1 P (qi|d) (2.10) Usually P (d) is assumed to be uniform. Qn
i=1P (qi|d) is the query like- lihood given the document model, which is traditionally computed using Dirichlet smoothing (see Eq. 2.7). Then for assigning a probability to the terms in the relevance model we have to estimate P (w|d); in order to do so it is also common to use Dirichlet smoothing. The final retrieval is obtained by four steps:
1. Initially the documents in the collection C are ranked using their query likelihood. This query likelihood is usually estimated with some kind of
2.4. Relevance Models 17
smoothing, commonly Dirichlet smoothing as in Eq. 2.7.
2. A certain top r documents from the initial retrieval are taken for the estimation instead of the whole collection C, let us call this pseudo- relevance set RS.
3. The relevance model probabilities P (w|R) are calculated using the esti- mate presented in Eq. 2.10, with RS instead of C.
4. To build the expanded query the e terms with highest estimated P (w|R) are selected. The expanded query is used to produce a second document ranking using negative cross entropy as in Eq. 2.11. In this second retrieval Dirichlet smoothing is commonly used.
e X i=1
P (wi|R) · log P (wi|d) (2.11) RM3 (Abdul-jaleel et al., 2004) is a later extension of RM that performs better than RM1 in terms of effectiveness. RM3 interpolates the terms se- lected by RM1 with the original query as in Eq. 2.12 instead of using them directly. The final query is used in the same way as in RM1 to produce a second ranking using negative cross entropy.
P (w|q0) = (1 − λ) · P (w|q) + λ · P (w|R) (2.12) As it has been demonstrated (Lv and Zhai, 2009a) as the best performing estimation of RM to the date, we will centre the work in this thesis on RM3. Although for some task we will also considered other estimations and other PRF methods for comparison, we firmly believe that RM3 is a very effective and quite robust starting point.