“TERRITORIO AUTONOMO KICHWA DE SARAYAKU
2) El Conflicto de Hegemonía que dibujan las víctimas
tweets themselves contain Twitter-specific vocabulary. For instance, when Twitter users post, hashtags, i.e., words beginning with the character ‘#’ are used to denote topics and concepts. These are used to link together many tweets about the same topic. Indeed, it has been reported that over 15% of tweets contain hashtags (Efron, 2011). Similarly, mentions, i.e., user names prefixed with the ‘@’ symbol are used to indicate replies or direct messages to the user in question. Additionally, Twitter allows a user to retweet another’s tweets, i.e., post an exact copy of another user’s tweet, normally with a reference to the source user (Boyd et al., 2010). Finally, one of the key characteristics of tweets is the inclusion of links to related content. For example, in a news vertical context, a tweet with little textual content might be considered relevant if it links to a useful news article. Hence, to rank tweets, a more advanced approach than using a document weighting model is needed.
In Chapter 8 of this thesis, we investigate how to effectively rank both blog posts and tweets. In particular, we propose learning to rank approaches (see Section 2.5.2) that leverage the unique charac- teristics of tweets and blog posts. These approaches aim to better capture the aspects of blog posts and tweets that may make them relevant, by representing them as sets of novel real-time stream features.
4.8
News-Related Content Integration
Figure 4.9: News-Related Content Integration component within our news search framework.
The News-Related Content Integration component is responsible for combining news-related con- tent into the Web search ranking. Figure 4.9 illustrates the News Content Integration component within
4.8 News-Related Content Integration
our news search framework. From Figure 4.9, we observe that the News-Related Content Integration component has three inputs. First the Web search ranking is provided by the Web vertical. Second, news-related content from newswire and user-generated sources that were ranked for the user query are provided by the Ranking News-Related Content component, i.e. newswire articles, blogs, tweets, Wikipedia pages and diggs in our case. Third, a ranking of the most important events of the moment, represented by newswire articles, is provided by the Top Events Identification component. The News- Related Content Integration component selects some of the top ranked news-related content, possibly in addition to the top new articles that represent currently important events, and merges them into the Web search ranking for the user. The aim is to increase the coverage of the event the user is searching about. The News-Related Content Integration component is vital to the search process. In particular, the idea behind a vertical within a universal Web search engine is to enhance the Web search ranking with specialist content (Arguello et al., 2009). Meanwhile, the majority of search users never look beyond the first few Web search results (Brin & Page, 1998). Therefore, we require a strategy to select the best of the content that we have ranked for display to the user. Indeed, in our previous ‘olympic swimming’ example (see Figure 4.1), even though this had a high probably that the user is looking for news-related results, only three newswire articles were integrated into the search ranking within a news results box.
The real-time nature of news reporting motivates the integration of user-generated content in addition to newswire content for some queries. In particular, as news stories are now being reported first in user- generated content sources such as Twitter, for a short period those sources will be the only ones that contain relevant content. Meanwhile, in cases where there are newswire articles that can be returned, returning user-generated content instead of, or in addition to them, may still be advantageous. For instance, by returning a selection of blog posts for a query relating to a U.S. political event, we might be able to provide both democrat and republican viewpoints. Furthermore, for fast-paced events, e.g. sporting matches, user-generated content might allow a live stream or timeline to be added to the results. It is of note that not all news-related queries are related to a specific event. Recall that in Section 4.3, we identified a class of generic news queries, e.g. ‘cnn news’. For this class of queries we might not want to add any news-related content at all. Instead, for these queries, it might be valuable to enhance the Web search results with the ranking of current news events from Top Events Identification component, in a similar manner to that shown previously in Figure 4.4.
In Chapter 9 of this thesis, we examine how to best integrate news-related content into the Web search results. In particular, we perform novel large-scale user-study evaluating whether the integration of user-generated content into the Web search ranking can better satisfy end-users than returning un- altered Web search rankings. We adapt the CORI resource selection/federated search (Craswell, 2000)
4.9 Conclusions
approach to unify the documents scores for documents ranked from different sources. We also develop a novel framework for comparative ranking evaluation, that simulates the ranking presentation by major Web search engines and facilitates preference assessment across rankings by workers (see Section 9.3). Using workers recruited from the crowdsourcing platform Amazon’s Mechanical Turk, we evaluate to determine whether end-users find that the integration of top events, newswire articles, blog posts, Digg posts, Twitter tweets and Wikipedia pages, better satisfy end-users than unaltered Web search rankings for news-related queries.
4.9
Conclusions
In this section, we proposed a new news search framework to describe the search process within a universal Web search engine for news-related queries. This news search framework is comprised of four components. For each component, we have described the functionality of that component, identified the challenges involved and motivated addition of user-generated content. In particular, in Section 4.2, we first discussed the real-time nature of news-related queries and the consequences for universal Web search engines. In Section 4.3, we presented a taxonomy to describe the types of news-related query that we consider in this thesis. Section 4.4 presented an overview of our news search framework and its four components, while Sections 4.5 to 4.8 detailed the challenges involved in each. In the next chapter, we describe the datasets that we use later to evaluate the approaches that we develop for each component of our news search framework.
Chapter 5
Evaluation Datasets and
Crowdsourcing
5.1
Introduction
To investigate the proposed news search framework described in Chapter 4, we first need datasets upon which we can evaluate each component. In a search setting, a typical evaluation dataset is comprised of three components, namely: one or more document corpora from a particular time-frame that are the subject of the search task; a set of search topics that describe the user information need; and a set of assessmentsthat define the ‘correct’ answer to each topic, providing a ground truth to facilitate evalua- tion. In this thesis, we require one or more evaluation datasets to evaluate each of the four components of our news search framework.
Information retrieval (IR) tasks like ad hoc Web search typically use standard datasets for evaluation, such as those used in IR evaluation workshops and forums like TREC (see Section 2.4.3). These datasets are advantageous as they enable comparison between different approaches to the same task using the same data. Where possible, we use standard evaluation datasets, e.g. the TREC 2011 Microblog track dataset (Ounis et al., 2011), to evaluate the components of our news search framework.
However, the news search process of a universal search engine has not previously been the focus of research by IR evaluation forums and similar venues, while prior work that has investigated similar tasks has predominantly used proprietary or private datasets. In fact, one of the main contributions of this thesis is the development and evaluation of approaches that leverage multiple temporally aligned content streams (corpora) simultaneously, while most existing datasets examine a single stream only. Hence, except when investigating the Top Events Identification and Ranking News-Related Content components — where we use TREC derived datasets — we develop new datasets for evaluation.
5.1 Introduction
Dataset generation is a time consuming and expensive process due to the need to employ hu- man assessors. Crowdsourcing has been championed as a fast and cheap means to generate new datasets (Alonso et al., 2008). Crowdsourcing in general is the act of outsourcing tasks, traditionally performed by a specialist person or group, to a large undefined group of people or community (referred to as the “crowd”), through an open call (Howe, 2010). We extensively use crowdsourcing in this thesis to generate datasets where none are available.
The aims of this chapter are two-fold. First, we describe each of the evaluation datasets that we use in the subsequent four experimental chapters in terms of the topics, corpora and assessments that comprise them. Secondly, for the subset of those datasets that include assessments that we have de- veloped ourselves using crowdsourcing, we provide a short description for each discussing how those assessments were developed, tested and validated.
The remainder of this chapter is structured as follows:
• In Section 5.2, we provide a listing of each dataset that we use in this thesis and detail their statistics.
• Section 5.3 discusses the field of crowdsourcing and the challenges to be overcome when devel- oping datasets using the medium.
• In Section 5.4, we describe the creation of the first of our crowdsourced assessments. These assessments facilitate the evaluation of the Top Events Identification component later in Chapter 6.
• Section 5.5 details the creation of our second set of crowdsourced assessments that facilitate the evaluation of the News Query Classification component of our news search framework, which is investigated in Chapter 7.
• In Section 5.6, we describe our third set of crowdsourced assessments for the evaluation of blog post ranking. We use these assessments later in Chapter 8.
• Section 5.7 details the creation of our final set of crowdsourced assessments that we use in Chap- ter 9.