INVENTARIO DE PROCESOS FLOTA PETROELRA ECUATORIANA

Interface Query Set Accuracy κf ree κf leiss Basic N QCtest

2006 0.9684 0.7373 0.3525 Headline Inline N QCtest

2006 0.9474 0.7866 0.3018 Headline+Summary N QCtest 2006 0.9684 0.83 0.5327 Link-Supported N QCtest 2006 0.9684 0.8367 0.5341 Link-Supported N QC₂₀₀₆f ull 0.9748 0.8358 0.7677

Table 5.11: Quality measures for news query classification on the approximately 100 query N QC2006test with varying interfaces, in addition to the over 1000 query N QC₂₀₀₆f ullusing the Link-Supported interface.

5.6 Crowdsourcing Task: Ranking Blog Posts for a News Story

In this section, we describe how we crowdsource the blog post ranking assessments as part of the BlogT rack₂₀₁₀T opN ews−P hase2 dataset. As before, we structure the section as follows. We begin by describing the task to be crowdsourced in Section 5.6.1. Section 5.6.2 details the interface that we use to collect blog post ranking assessments. In Section 5.6.3, we describe how we validate the assessments produced by our workers. Section 5.6.4 discusses the configuration of our crowdsourced tasks. Finally, in Section 5.6.5, we report on the quality of the assessments produced.

5.6.1 Task Description

Recall from Table 5.5 that BlogT rackT opN ews−P hase2₂₀₁₀ is a TREC dataset, developed for the 2010 top news stories identification task, rather than specifically developed to evaluate a component of our framework. The TREC task investigated both blog post ranking and diversification for news article headlines (Ounis et al., 2010). Hence, the crowdsourced assessment task is to judge each of the pooled blog posts as relevant, possibly relevant or not relevant to a newswire article (facilitating relevance evaluation), and also to suggest perspectives that describe each blog post (facilitating diversity evaluation)1_. Here, perspectives come from a set of 9 categories that a blog post might be considered to belong to, e.g. a blog post might be considered to contain republican or democrat viewpoints for political stories. The nine perspectives that were defined by TREC for the task are: Factual Account; Opinionated Positive; Opinionated Negative; Opinionated Mixed; Short summary/Quick bites; Live Blog; In-depth analysis; Aftermath; and Predictions.

In total, 7,975 blog posts from the Blogs08 corpus are to be assessed (see Table 5.5). These blog posts were selected via a pooling strategy for 68 newswire articles from the TRC2 Thomson Reuters newswire corpus (see Section 5.2.3). In particular, for all 68 newswire articles, a series of blog post

1_{The approaches for blog post ranking that we investigate in Chapter 8 focus on relevancy and do not attempt diversification,}

5.6 Crowdsourcing Task: Ranking Blog Posts for a News Story

Figure 5.7: A screenshot of the external judging interface shown to workers within the instructions.

ranking systems (submitted to TREC 2010) produced three rankings of blog posts using a representation of those newswire articles as queries. One ranking containing posts published before each newswire article, one containing blog posts from the following day and before and one containing posts from a week following and before. This represents a system ranking for a newswire article at different times, i.e. as the story that the newswire article refers to matures. The top 20 posts from each of these three rankings were combined to form the 7,975 blog post pool to be assessed.

5.6.2 Interface Design for Blog Post Ranking

To crowdsource assessments for each of the 7,975 blog posts, we develop an assessment interface that both renders the blog posts to the user and records that user’s assessment. To develop this interface, we follow the iterative design methodology proposed by Alonso & Baeza-Yates (2011). In particular, we first create a small test set comprised of 200 blog posts returned for two news stories that we manually assess. We iteratively develop different prototype interfaces, evaluate worker performance and subse- quently made improvements. For brevity, we omit the intervening iterations of the interface. Figure 5.7 illustrates the final interface produced. From Figure 5.7, we see that the interface is divided into three components: the instructions and newswire article headline at the top; the assessment options down the left hand side and the rendering of the post to be assessed on the right hand side. In this example, at the top of the interface, the instructions are hidden. Clicking on the instructions within the interface reveals them to the user. See Appendix B.3 for the full instruction set provided to each worker.

5.6 Crowdsourcing Task: Ranking Blog Posts for a News Story

Notably, one outcome of the iterative design process was that we increased the number of assessments that each worker makes in one sitting to 20. Indeed, when assessing the 200 blog posts in our small test set, we found that the completion time when assessing 20 posts was reduced by 79.2% over assessing a single post at a time. We also observed that the rendering of blog posts from the Blogs08 corpus can be difficult. In particular, the blog posts within that corpus only contain the raw HTML. When rendering, additional content such as images and CSS files need to be loaded from the original website, which may have changed or might have been removed. For example, Figure 5.8 (a) illustrates an example blog post when presented in original HTML form. As we can see, the page template has changed markedly, such that the main page content is no longer visible (one would need to scroll down to get to the main article content). To counteract this, as well as to decrease loading delays, by default we show workers a cleaned version of the Web page, created by extracting only the text within < h > or < p > tags. Indeed, Figure 5.8 (b) shows the cleaned version of the same page. However, we do provide a full HTML rendering that can be loaded by pressing the ‘Full Post’ button on the left component of the interface (see Figure 5.7), in case the text cleaning mistakenly removes the main content of the article.

5.6.3 Validation of Worker Assessments

As before, to ensure the quality of the relevance assessments produced, we employ validation strategies. In this case, we use a form of gold judgement validation. In particular, typical gold standard validation involves the prior creation of a gold-standard judgement set, with which to test workers, known as a ‘honey pot’ (see Section 5.3.2). Our approach is similar, except that we wait until all of the work is completed, such that we can better identify those documents that workers found difficult, i.e. disagreed on. For each set of 20 blog posts, we select three posts to be validated against a gold standard, i.e. one judged relevant, one judged possibly relevant and one judged not relevant. For each of these selected posts, the author assessed their relevance to the associated newswire article, forming the gold standard. If more than one of these did not match this gold standard, then all 20 assessments were rejected and re-posted for another worker to complete on the grounds that the work was not of sufficient quality. Overall, this resulted in a gold standard set of roughly 15% of all blog posts to be assessed (1197/7975), and took within the region of 8 hours to create. This is naturally longer than it would take to create a normal 5% gold standard set. However, by using a larger and more evenly distributed gold standard, we have greater confidence in the reliability of the gold standard, and hence the quality of the resulting assessments.

5.6 Crowdsourcing Task: Ranking Blog Posts for a News Story

(a) HTML.

(b) Cleaned.

Figure 5.8: An example blog post rendering both when cleaned and as HTML. The HTML rendering contains the same content as the cleaned version, however, due to a missing CSS template one would need to scroll down to get to it.

5.6 Crowdsourcing Task: Ranking Blog Posts for a News Story

Batch # Stories # HITs # Judgements Pay Per HIT Hourly-rate ($)

Batch 1 1→5 27 540 0.50 2.27 Batch 2 6→15 66 1320 0.50 3.85 Batch 3 16→28 78 1560 0.50 3.90 Batch 4 29→50 137 2740 0.50 3.58 Batch 5 51→68 102 2040 0.50 2.27 Batch 6 68 68 80 0.25 6.07

Table 5.12: Average amount paid per hour to workers and work composition for each batch of HITs.

5.6.4 Crowdsourcing Configuration

For crowdsourcing, we use the Amazon Mechanical Turk (MTurk) marketplace. Recall that in total we assess 7,975 blog posts from 68 newswire articles. We spread these over 433 MTurk HIT instances, each containing approximately 20 posts. We pay our workers $0.50 (US dollars) per HIT for the 20 assessments. The total cost is $238.70 (including Amazon’s 10% fees). Due to the larger and more thorough gold standard validation set employed, we have only a single worker assess each HIT. We did not restrict worker selection based on geography, however only workers with a prior 95% acceptance rate were accepted for this task.

Continuing with an iterative methodology (Alonso et al., 2008), we submitted our HITs in 6 distinct batches, allowing for feedback to be accumulated and HIT improvements to be made. The first five batches were comprised of HITs containing 20 blog posts to be judged. The sixth batch contained all of the remaining blog posts for each news story. Table 5.12 reports the statistics and per-hourly rate paid to workers during each of the six batches.

5.6.5 Assessment Quality

For the blog post ranking task, we had a single worker judge each blog post. As such, we cannot use inter-worker agreement to estimate final quality (at least three workers per blog post would be required). Instead, to determine the quality of our resulting relevance assessments, we compare them against assessments produced by the author as a ground truth. In particular, we randomly sampled 5% of the blog post set judged (360/7975) and manually assessed each post in terms of its relevancy to the associated news story. Note that this is not the same as the gold standard used to validate the workers during the production of the relevance assessment (see Section 5.6.3), but is a different set used to evaluate the quality of the final relevance assessments produced. Table 5.13 reports the accuracy of the crowdsourced blog post relevance judgements in comparison to the aforementioned ground truth, both overall (All), and in terms of each relevance grade.

In document Propuesta de un modelo de gestión por procesos aplicado a la "Flota Petrolera Ecuatoriana" (página 113-116)