• No se han encontrado resultados

El conflicto de Hegemonía que dibuja la prensa escrita:

“TERRITORIO AUTONOMO KICHWA DE SARAYAKU

4) El conflicto de Hegemonía que dibuja la prensa escrita:

Recall from Section 4.5 that the purpose of the Top Events Identification (TEI) component of our news search framework is to identify the most important events that are occurring at any given moment, where events are represented by newswire articles. We propose to investigate whether the top events of the moment can be identified using real-time discussions in user-generated content sources. To evaluate this, we use datasets that contain newswire articles to represent events, a stream of user-generated content for estimation the importance of each newswire article and assessments identifying those news articles that represent important events.

Table 5.3 summarises the contents of each dataset that we use in this thesis to evaluate the TEI component of our news-search framework, in addition to the time-frame that each covers. As we can see from Table 5.3, we use five datasets. The first and second datasets are TREC datasets designed for

5.2 Datasets Overview

Dataset TREC Crowdsourced # of Time-Range Corpora Used

Dataset? Assessments? Assessments Newswire Articles User-Generated Content BlogT rackT opN ews2009 4 6 10,887 01/01/08 → 28/02/09 N Y T 08 Blogs08

BlogT rackT opN ews−P hase1

2010 4 4* 8,000 01/01/08 → 28/02/09 T RC2 Blogs08 T witterT opN ews−N Y T

Dec2011 6 6 1,456 17/12/11 → 31/12/11 N Y TDec2011 T weetsDec2011/J an2012 T witterT opN ews−N Y TJ an2012 6 6 2,310 05/01/12 → 12/01/12 N Y TJ an2012 T weetsDec2011/J an2012 T witterT opN ews−ReutersJ an2012 6 6 3,102 05/01/12 → 12/01/12 ReutersJ an2012 T weetsDec2011/J an2012 Table 5.3: Top Events Identification datasets used in this thesis. Datasets that were produced by TREC or that contain crowdsourced relevance assessments are denoted with a4 in the associated column. Where crowdsourced assessments are marked with a4* the dataset assessments were crowdsourced by ourselves on behalf of TREC.

the Blog track 2009 and 2010 top news stories identification task (see Section 3.2.3). We denote these datasets BlogT rack2009T opN ewsand BlogT rackT opN ews−P hase12010 , respectively. They contain newswire articles from the New York Times (N Y T 08) and Reuters (T RC2) news providers that were published during the period of 2008. The newswire articles published on specific ‘topic’ days from these news providers were used to represent events to be ranked. The datasets also contain the Blogs08 blog post corpus, which systems use as evidence to rank the newswire articles for each topic day. These two datasets comes with pre-provided newswire article importance assessments for each topic day. For a day of interest, newswire articles from that day were assessed as important or not for that day. In particular, the BlogT rack2009T opN ewsdataset provides 10,887 assessments for 55 topic days (Macdonald, Soboroff & Ounis, 2009), while the BlogT rackT opN ews−P hase12010 dataset contains 8,000 assessments for 50 topic days (Ounis et al., 2010). Recall that in the preface to this section, we discussed how for the TREC 2010 Blog track top news stories identification task datasets, we crowdsourced the relevance assessments on the behalf of TREC. BlogT rackT opN ews−P hase12010 is one of these two datasets. We describe our methodology for creating these assessments, the validation strategies that we employ and evaluate the quality of the assessments produced later in Section 5.4. This is the first of the four datasets that use crowdsourcing to generate assessments.

Returning to Table 5.3, we see that the remaining three datasets instead contain Twitter corpora for systems to leverage when estimating the importance of events, i.e. the T weetsDec2011/J an2012corpus. These datasets once again use newswire articles from the New York Times and Reuters to represent the events to be ranked. Note that these datasets were not developed by TREC, rather we developed these datasets since no other datasets for the task existed. The aim of producing these additional datasets was to facilitate the evaluation of top events identification on tweets. We denote these three datasets as T witterT opN ews−N Y TDec2011 , T witterJ an2012T opN ews−N Y Tand T witterJ an2012T opN ews−Reuters, respectively. For these datasets, we generated newswire article importance assessments for different points in time based upon

5.2 Datasets Overview

the ordering of articles present on the homepages of the news providers whose newswire articles are to be ranked. In effect, this means that our importance assessments are created by the newspaper editor who selected each article for display. In particular, from each homepage, we scrape the set of current newswire articles for our system to rank and also the ground truth ranking of stories against which we will compare. We assign each newswire article present on the homepage a score based upon its place- ment and prominence. These scores range from ‘3’ to ’0’, where a score ‘3’ is the most important and a score ‘0’ is the least important. Figure 5.2 illustrates this process for the New York Times homepage on the 10th of January 2012. The primary feature that is used to determine the score of a story is the font size of the headline, the larger the font the higher rank is assigned. Stories that appear further down the page, editorials and special interest stories are also demoted. Also note that there is only ever one score ‘3’ story, which is considered to be the ‘top’ story of the moment. T witterT opN ews−N Y TDec2011 uses 45 homepages downloaded from the New York Times between the 17th and the 31st of December 2011 for assessments. T witterT opN ews−N Y TJ an2012 uses 55 homepages from the New York Times between the 5th and 12th of January 2012, while T witterT opN ews−ReutersJ an2012 uses 63 homepages from Thomson Reuters over the same period.