The development of xQuAD aimed for an effective and general ranking objective for search result diversification, by encompassing successful features of past re- search in a principled manner. In particular, the explicit aspect representation adopted by xQuAD was inspired by the proportional coverage (PC) approach of Radlinski & Dumais (2006). As formalised in Equation (3.17), their approach seeks to balance the coverage of multiple reformulations of the initial query among the documents ranked in response to this query. Although query reformulations provide a meaningful alternative for representing the multiple possible informa- tion needs underlying a query as sub-queries, our framework caters for several dimensions of the diversification problem, which are not addressed by the ap- proach of Radlinski & Dumais(2006), such as the relative importance of different sub-queries and the redundancy of covering already well covered sub-queries.
As a matter of fact, xQuAD can emulate the approach ofRadlinski & Dumais
(2006) as well as other coverage-based approaches, by assuming that the identified sub-queries do not lose their utility as more documents that cover these sub- queries are selected. In practice, as will be discussed in Section 8.2.2, this can be achieved by dropping xQuAD’s novelty component, p( ¯Dq|q, s), from the expanded
formulation in Equation (4.4). Furthermore, a proportional coverage of sub- queries, similar to the one deployed by approaches like PC (Equation (3.17)) and WPC (Equation (3.18)), can also be enforced within xQuAD, by conditioning the scoring of documents that cover a particular sub-query s on the total number of documents already covering this sub-query, such that:
p(d|q, s) = p(d|q, s), if hP dj∈Dq1(p(dj|q, s) > 0) i < p(s|q) τ, 0, otherwise, (4.17)
where 1 is the indicator function, returning 1 if p(dj|q, s) > 0 (i.e., if the docu-
ment dj covers the sub-query s), or 0 otherwise. On the right-hand side of the
inequality, p(s|q) and τ denote the importance of s and the diversification cutoff, respectively, in which case the product p(s|q) τ determines the fraction of the final ranking that should be dedicated to the sub-query s.
With respect to its diversification strategy, the xQuAD framework can be seen as a generalisation of the IA-Select approach ofAgrawal et al. (2009). As defined in Equation (3.23), this hybrid approach seeks to maximise the overall utility of the ranked documents in light of the multiple categories associated with the query. In particular,Agrawal et al.(2009) proposed to approximate the marginal utility f(c|q, Dq) of any document covering each category c, given the query q,
and the already selected documents in Dq, according to:
f(c|q, Dq) ≈ f(c|q)
Y
dj∈Dq
(1 − f(dj|q, c)). (4.18)
Contrasting Equation (4.18) with the definition of xQuAD in Equation (4.7), we note the similarity between the components in the right-hand side of Equa- tion (4.18) with xQuAD’s sub-query importance and novelty components, respec- tively. In particular, with xQuAD, not only do we provide a formal probabilistic argument for maximising the utility of a ranking, but we also devise this for- malisation in light of an aspect representation that better reflects the multiple information needs—as opposed to multiple categories—underlying the query.
Besides formalising the notion of utility in probabilistic terms, we extend this notion to cater for queries with different levels of ambiguity, by mixing rele- vance and diversity estimates through the diversification trade-off λ, as described in Equation (4.1).2 The resulting mixture is in turn inspired by the maximal
marginal relevance (MMR) approach of Carbonell & Goldstein (1998), described in Section3.3.1. As we will show in Chapter9, our generalised formulation enables a selective diversification approach, which automatically adapts itself to diversify more or less aggressively, given the predicted ambiguity of each query. However, a fundamental difference from MMR is our adoption of an explicit aspect represen- tation, enabling the combination of coverage and novelty into a hybrid strategy, which outperforms the pure novelty-based strategy deployed by MMR, as we will show in Chapter 8. Also note that an implicit version of xQuAD can be trivially derived by adopting a document-oriented aspect representation, e.g., by letting Sq = V, where the lexicon V comprises all unique terms in the underlying corpus.
2
4.5
Summary
This chapter introduced a novel approach to the search result diversification prob- lem, described in Chapter3. The proposed Explicit Query Aspect Diversification (xQuAD) framework models multiple dimensions of the diversification problem in a principled manner, under the formalism of probability theory.
In Section 4.1, we identified three limitations of different families of related approaches from the literature, in terms of their reliance solely on the docu- ments initially retrieved for a query, their arbitrarily defined representation of the multiple information needs underlying this query, and their heuristic rank- ing objectives. In order to overcome these limitations, Section 4.2introduced the xQuAD framework with the goal of pursuing a diversification driven by the users’ information needs. Besides formalising xQuAD’s ranking objective in probabilis- tic terms, we introduced the several components that naturally emerge from this formulation. A complete example of the operation of the framework was pro- vided in Section 4.3, where its underlying computations were defined in terms of basic matrix operations. Finally, Section 4.4 highlighted the key features of related approaches from the literature that inspired the development of xQuAD. In particular, the framework can be seen as a principled generalisation of the most prominent representatives of the three families of diversification approaches described in Section 3.3, namely, novelty-based, coverage-based, and hybrid.
At this stage, perhaps the most distinguishing feature of the xQuAD frame- work is its generality, as a result of modelling all dimensions of the diversification problem, as introduced in Chapter 3. An immediate advantage of such a general formulation is the possibility of instantiating each of the components of the frame- work in different ways, with each instantiation having the potential to contribute to an overall effective diversification performance. Experimenting with multiple such instantiations will be the goal of the next chapters. In particular, Chapter 5
will thoroughly assess the xQuAD framework by contrasting it to state-of-the- art representatives of the various families of diversification approaches described in Section 3.3. Chapter 6 will introduce a novel learning to rank approach for generating effective sub-queries, mined as query suggestions from a query log. In turn, Chapter 7 will introduce a supervised approach to predict the effectiveness
of multiple intent-aware ranking models for estimating the coverage of each doc- ument with respect to each sub-query, as well as the novelty of the document, given the sub-queries covered by the already retrieved documents. The role of novelty as a diversification strategy will be further analysed in Chapter 8. Lastly, Chapter 9 will introduce a supervised mechanism for selectively diversifying the retrieved documents, by automatically adapting the diversification trade-off given the predicted ambiguity level of each individual query.
Framework Validation
As introduced in Chapter 4, the xQuAD framework provides a principled and general formulation for tackling the search result diversification problem. Indeed, different components of the framework model different dimensions of this prob- lem, such as the identification of multiple query aspects and the estimation of the relevance of each retrieved document with respect to each identified aspect. Naturally, the effectiveness of the framework depends on the effectiveness of the particular choices for instantiating each of these components. Before introducing effective alternative instantiations for each of these components in the subse- quent chapters, in this chapter, we validate the xQuAD framework as a whole, by contrasting it to the current state-of-the-art in search result diversification.
The goals of this chapter are twofold. Firstly, in Section5.1, we introduce the basic experimental methodology that is used throughout the experimental part of this thesis, which comprises this chapter and Chapters 6through 9. Secondly, in Section5.2, we thoroughly validate the effectiveness of the xQuAD framework in comparison to state-of-the-art representatives of different families of diversifi- cation approaches in the literature. In addition, we break down this evaluation along the complementary dimensions of aspect representation and diversification strategy, as introduced in Section 3.3. The results of this evaluation not only attest the effectiveness of xQuAD when compared to the current state-of-the-art, but they also validate our option for a hybrid, user-driven diversification.