2.1 Revisión de literatura
2.1.3 Fundamentos teóricos
2.1.3.5 Solvencia financiera
A diverse tool kit of techniques rooted in statistics (e.g., time series analysis in finance and economics), physical sciences (e.g., signal processing) and data mining has emerged to ex- plain and model stochastic processes characterised by time series data (Fu, 2011; Radinsky et al., 2013b). Common tasks include time series modelling for future forecasting, periodic- ity detection and correlation. This thesis employs several of these techniques; I describe the related approaches in the context of each relevant chapter.
3.14
Chapter Summary
The definition of time is simultaneously diverse and multi-faceted, as shown by its involve- ment in all user- and system-oriented aspects of the conceptual IR map. I outlined the nature of explicit temporal clues (i.e., temporal expressions) found in information content, as well as implicit temporal clues (i.e., temporal dynamics) arising from streams of information be- haviour occurring over time. I explored the nature of time in information behaviour and time-aware IR. To that end, I proposed a conceptual map of time-aware IR, and used it to or- ganise the diverse body of relevant theoretical and practical literature. As part of this model,
3.14 Chapter Summary I discussed and illustrated several real collection-based (e.g. word usage) and information seeking (e.g. query and intent popularity) temporal dynamics. Finally, I detailed the prac- tical representation of temporal dynamics used by experiments later in this thesis. Overall, time poses many challenges, but also yields many opportunities for IR. Work presented in the remainder of this thesis serves as an exploration into methods to support and exploit one particular element of time – temporal dynamics – in IR systems.
Part II
Supporting Temporal Dynamics in
Information Seeking
Chapter 4
Recent and Robust Query
Auto-completion
In Chapter 3, I explored several temporal dynamics evident in information seeking activity, in particular, patterns and trends in query popularity over time. Consistently supporting users who are engaging in real-time information seeking driven by events and phenomena requires approaches which are sensitive to temporal dynamics. One of the foremost challenges for users during retrieval is formulating an adequate query to express their information need sufficiently to a retrieval system – and thus increase the chance of receiving satisfactory search results. In this chapter, in order to explicitly assist users to formulate queries which can better satisfy their information needs over time, I propose and experiment with novel time-aware approaches for query auto-completion (QAC) in web search – a ubiquitous activity performed
hundreds of millions of times every day.∗
4.1
Introduction
Cognitively formulating and physically typing search queries is an especially time-consuming and error-prone process. Spelling mistakes, forgetfulness and information need uncertainty often make textual query input laborious (Moshfeghi and Jose, 2013). In response, search engines have widely adopted QAC as a means of reducing the effort required to submit a query (Bar-Yossef and Kraus, 2011; Shokouhi and Radinsky, 2012). Indeed, beyond query input in search systems, text input auto-completion has also become popular in many other applications in which there is likely to be common input between users, such as inaccu- rate touch-screen text input, content tag selection, text ‘hashtagging’ (e.g., “#topic”) and domain-specific search (e.g. maps, jobs and people). As the user types their query into the search box, QAC suggests possible queries the user may have in mind (which I refer to as
4.1 Introduction completion suggestions), beginning with the currently input character sequence (i.e., prefix). Recent work has examined approaches for making QAC robust to spelling mistakes (Duan and Hsu, 2011) and term re-ordering.
The primary objective for effective QAC is to: (i) present the user’s intended query after the fewest possible keystrokes, and (ii) at the highest rank in the list of completion suggestions. The most common approach to QAC is to extract past queries with each prefix from a query log, and rank them by their past popularity (Bar-Yossef and Kraus, 2011); this assumes cur- rent query popularity is the same as past query popularity. Although this approach provides satisfactory QAC on average, it is far from optimal since it fails to take into account clues such as time or user context which often influence the queries most likely to be typed. As a result, this thesis chapter explores QAC approaches which are sensitive to changing query popularity – where the popularity is not predictable from long-term past query popularity observations.
As the web increasingly becomes a platform for real-time activity, news and media, time plays a central role in information behaviour. A substantial proportion of the daily query volume is the result of users turning to search engines for information about recent and ongoing events and phenomena (Adar et al., 2007; Kairam et al., 2013; Kulkarni et al., 2011). Indeed, 20%
of daily Google queries have not been seen in the past 90 days1, with 15% have never seen
before2. While the long-tail will inevitably account for a large proportion of these queries,
many will be the result of short-term temporal events. We illustrate this with the following example.
Figure 4.1: Google auto-completion suggestions for the query prefix ‘k’. Screenshot taken September 23rd 2013, during the ongoing Westgate shopping mall terrorist attack in Kenya. Persistent browser cookies were cleared to avoid any individual personalisation effects.
Figure 4.1 shows the four completion suggestions offered by Google for the single character query prefix ‘k’ on September 23rd, 2013. The list of completion suggestions indicates the historically most likely queries to be submitted with the prefix, possibly in the context of some
1http://googleblog.blogspot.co.uk/2009/12/this-week-in-search-121809.
html
4.1 Introduction
Figure 4.2: Google Search Trends indicating temporal popularity of the completion sugges- tions in Figure 4.1 during August and September 2013.
undisclosed ranking features such as user location. Despite the recency and prominence of the Kenya Westgate mall terrorist attacks, the query ‘kenya’ ranks very low in the completion suggestions. In Figure 4.2, I show the dramatic change in query popularity caused by the events – ‘kenya’ becomes by far the most popular query. Yet, despite the fact that ‘kenya’ is trending because of the ongoing events, Google’s QAC fails to support users searching for information about the event as it ranks completion suggestions based on the past query distribution - which is no longer appropriate. Further compounding this issue, QAC for short prefixes (i.e. 1-2 characters) is often unsuccessful as there are such a large number of possible completion suggestions (Bar-Yossef and Kraus, 2011). It is typically consistently popular ‘head’ queries that are provided as completion suggestions for such short prefixes (evidenced by the celebrity queries shown in this example). Therefore, there is a need to take the temporal aspect into account for effective QAC. This work is an attempt towards this objective. Furthermore, queries that need to be included in completion suggestions fall into two main categories. The first category corresponds to predictably popular queries which are: (i) consistently popular, (ii) temporally recurring (e.g. at Christmas, in January, etc.) or (iii) known/foreseeable events and phenomena (e.g. TV episodes, sporting events, expected weather etc.). The second category corresponds to unpredictably popular queries related to entirely unforeseeable current events and phenomena (e.g. breaking news). Of course, although these may be unpredictable prior to the event occurring, once the query popularity is trending, then further popularity may become predictable based on short-range trends. Indeed, queries are likely to switch between these categories over time, making longer-term predictions problem- atic. Therefore, achieving optimal QAC effectiveness for all users, on average, is a trade-off
4.1 Introduction between opposing objectives: (i) time-sensitivity, or recency, and (ii) robustness. Recency re- quires that completion suggestions include emerging and increasingly popular queries. Con- versely, robustness requires that completion suggestions also reliably include long-term and consistently popular queries. These two goals are at odds – completion suggestions com- prised only of short-term popular queries (e.g. in the last hour) might lead to lower ranking of many consistently popular queries. Alternatively, completion suggestions comprised of long-term popular queries (e.g. in the last year) will likely exclude the most recently popular queries. We propose an approach to address this trade-off by developing models which take recent evidence into account when possible.