• No se han encontrado resultados

El Conflicto de Hegemonía que dibujan las ONG´s.

“TERRITORIO AUTONOMO KICHWA DE SARAYAKU

1) El Conflicto de Hegemonía que dibujan las ONG´s.

However, there are challenges in leveraging user-generated content for news query classification. Foremost of these challenges is how to relate postings from different newswire and user-generated con- tent streams into a unified estimate of a query’s news-relatedness. Another challenge arises from the temporal aspects of news query classification. In particular, news-related queries tend not to occur in isolation, as many users use the same query to search about an event over time. Hence, an important challenge is develop a classifier that can maintain or even increase its accuracy over time. In Chapter 7, we propose a novel news query classification approach and investigate both these challenges.

4.7

Ranking News-Related Content

Figure 4.8: Ranking News-Related Content component within our news search framework.

The Ranking News-Related Content component aims to find relevant content that can be then in- tegrated into the Web search results for display to the user. In particular, assuming that the universal search engine has access one or more sources of continually updating news content, for each of those sources, the goal is to find relevant content for the user’s news-related query. Figure 4.8 illustrates the Ranking News-Related Content component within our news search framework.

In Section 2.3, we described how an information retrieval (IR) system uses document weighting models to rank documents with respect to a query. Document weighting models have been shown to be effective, although they can fail for some queries (Savoy, 2007). In this thesis, we rank five differ- ent types of newswire and user-generated content, namely: newswire articles, blogs, tweets, Wikipedia

4.7 Ranking News-Related Content

pages and diggs. Prior works have shown that newswire articles can be effectively ranked using doc- ument weighting models, like those described in Section 2.3 (Voorhees et al., 2005). Moreover, the content of individual diggs are newswire article titles and snippets, hence it is reasonable to assume that document weighting models will also be effective for ranking them. No works explicitly examine the use of document weighting models for ranking Wikipedia pages for a user query. However pseudo- relevance feedback techniques that are based upon document weighting models have been shown to be effective when using Wikipedia (Xu et al., 2009). Therefore, for these three sources, we use the DPH document weighting model (Equation 2.9) to rank them. However, blogs and tweets have some unique characteristics that are both relevant to their ranking for news-related queries and that are not accounted for by document weighting models. We describe these characteristics in the following two subsections.

4.7.1

Blog Posts

The blogosphere is a prime example of user-generated content. The term blogosphere refers to all of the blogs on the Web. The term blog is a contraction of the word ‘weblog’ that describes the act of someone using the Web to record their thoughts on a particular subject. Importantly, one blog may contain multiple blog posts in chronological order, where each blog post is normally a statement of opinion or viewpoint on a given subject by the blogger.

The large volume of blogs posted each day may make the blogosphere a valuable source of informa- tion about current news stories, as bloggers post about the stories that interest them. Indeed, a poll by Technorati has shown that 30% of their respondents blogged on news related topics (Sussman, 2009). Work by Mishne & de Rijke (2006) also showed a strong link between blog searches and recent news - indeed almost 20% of searches for blogs were news-related, indicating that news is popular in the blo- gosphere. Moreover, Thelwall (2006) explored how bloggers reacted to the London bombings, showing that bloggers respond quickly to news as it happens.

Blog or blog post ranking can be challenging however. In particular, typical document weighting models only consider the relatedness between the query and the document (see Section 2.3), while in a blog setting it might be effective to account for the unique characteristics of blog posts when ranking (see Section 3.2.2). For instance, for news-related queries, end-users might be interested in opinionated blog posts or those that are more authoritative. Hence, to rank blogs or blog posts, we require more advanced ranking approaches that can account for these characteristics.

4.7 Ranking News-Related Content

4.7.2

Real-time Tweet Search

Twitter is a popular microblogging service, which provides an easy means for users to publicly and instantly post messages – known as tweets – not exceeding 140 characters on the Web. The content posted to Twitter varies greatly, from informative news snippets to spam and scams (Kwak et al., 2010). Twitter provides a search service to access relatively recent tweets1. In general, tweets retrieved using Twitter search are presented in reverse chronological order (Nagmoti et al., 2010), with the exception of ‘promoted’ tweets paid for by companies (Rickns, 2010).

Ranking in a microblog setting differs markedly from traditional information retrieval search tasks. In particular, rather than ranking in order of relevance (Manning et al., 2008), microblogs are often returned in reverse chronological order (Nagmoti et al., 2010). The reason behind this different ranking approach is that information needs posed to microblog search engines hold a strong temporal compo- nent. Indeed, search in a microblog setting can be considered as answering the question ‘find me the most recent information about X’. Indeed, if we consider Twitter as a source of content to display for news-related user queries, then it is clear that the value Twitter might bring would be with regard to very recent content.

In a universal Web search setting, it might be advantageous to return groups of related tweets or provide live tweet updates as they are posted. Indeed, one of Google’s search enhancements was a ‘Latest Results’ feature that would display scrolling updates in reverse-chronological order for a news- related query, including tweets (Singhal, 2010). Hence, for breaking news stories, it may be useful to display tweets about the story in reverse-chronological order.

However, even when displaying tweets in reverse-chronological order, tweets need to be ranked for the user query. In particular, for any given query, many of the tweets that contain one or more of that query’s terms will be irrelevant. Hence, tweets need to be ranked for the query and the most relevant of those selected (Duan et al., 2010). Indeed, Twitter provides similar functionality with its ‘top tweets’ feature.

Ranking tweets effectively for a query is challenging however, since tweets have some unique struc- ture and characteristics that impact their relevance and quality. Firstly, tweets are by design very short — only 140 characters in length maximum. This short length may make document weighting models such as BM25 (Equation 2.4) less effective, since the term frequency component of such models pro- vides little information as each term is likely to only appear once in a tweet. Furthermore, the shorter tweet length may make vocabulary mismatch between the query and relevant tweets more acute, re- ducing the recall of the tweet rankings produced with standard document weighting models. Next, the