the probability of agreement occurring by chance. Free-Marginal Multirater Kappa assumes that the chance of selecting a class is equal to one over the number of classes, while Fleiss Multirater Kappa takes into account the relative size of the classes. We report agreement as a measure of assessment quality for the three datasets for which we use redundant judging (see Table 5.7).
5.3.3
Structured Crowdsourced Evaluation
In the following four sections, we describe how we crowdsource assessments for the four datasets listed in Table 5.7. The primary aim of each of these sections is to show that the assessments produced for each dataset are of good quality. The secondary aim is to describe how we crowdsourced each of the assessments as a contribution in the field of crowdsourcing for IR.
A crowdsourced assessment task can be divided into five stages, namely: define the task; develop an assessment interface; prepare assessment validation; crowdsource the assessments; and quality assure those assessments. As such, we structure each of the following four sections in line with these stages. In particular, each section is comprised of five subsections. The first describes the task to be crowdsourced. The second details the assessment interface developed. The third describes how we validate the assess- ments. The fourth defines how we configured each crowdsourcing task. The fifth reports the quality of the assessments produced. Note that there is a difference between assessment validation and qual- ity assurance. Assessment validation is employed to increase the quality of the resulting assessments, while quality assurance measures the quality of assessments after validation has been applied (normally in terms of agreement between workers, see Section 5.3.2).
5.4
Crowdsourced Task: Identifying Top Events
In this section, we describe how we crowdsource importance assessments for the newswire articles con- tained within the BlogT rackT opN ews2009 dataset. We structure this section as described in Section 5.3.3. In particular, in Section 5.4.1, we describe the crowdsourcing task. Section 5.4.2 details the inter- face designed to facilitate the judging of newswire articles by workers. In Section 5.4.3, we describe how we validate the assessments produced by the workers. Section 5.4.4 lists the configuration of the crowdsourcing task. In Section 5.4.5, we discuss how accurate the resultant assessments are in terms of inter-worker agreement (see Section 5.3.2).
5.4 Crowdsourced Task: Identifying Top Events
5.4.1
Task Description
The crowdsourcing task that we examine in this section is to judge whether a set of newswire arti- cles describe important events on specific day, referred to as a topic day1. In particular, for a given newswire article, a day of interest and a news category (U.S./World/Sport/Business/Technology news), each crowdsourced worker assesses whether that newswire article is newsworthy or not for that day and category. Only stories that belong to the named category can be newsworthy. We use these news categories to reduce the assessment load on the workers. For example, it is easier to assess whether a newswire article is related to an important political story than to assess whether it is important in general. Each news story is assessed by workers as belonging to one of three classes:
• Newsworthy and correct category: The story is newsworthy for the topic day and news category. • Not newsworthy but correct category: The story is not particularly newsworthy for the topic day,
but does match the news category.
• Incorrect category: The story belongs to a different news category.
Notably, for the purposes of the final assessments produced, the ‘Not newsworthy but correct cate- gory’ and ‘Incorrect category’ are both considered non-newsworthy. As shown earlier in Table 5.3, there are 8,000 newswire articles to be assessed, spread over 50 topic days. These articles were selected using a pooling strategy (see Section 2.4.2) over multiple news article rankings from different systems. In this case, statMAP sampling (Aslam & Pavlu, 2007) is used to a depth of 32 stories per day and category, resulting in 160 stories per day to be judged, with 8,000 stories in total (50 topics * 5 new categories * 32 newswire articles) (Ounis et al., 2010).
5.4.2
Interface Design for News Article Assessment
To enable workers to assess each newswire article, we develop a new assessment interface. Figure 5.3 illustrates an example of the interface that we designed for the task. From Figure 5.3, we see that the current news category and day of interest is shown at top left. Meanwhile, down the left hand side we provide a listing of the newswire articles that are to be assessed. This summary is colour coded. A blue question mark in square brackets indicates that the newswire article is yet to be assessed, a green plus symbol in square brackets indicates that the newswire article has been judged as newsworthy for the stated category, an orange minus symbol in square brackets denotes that the newswire article was
1Authors Note: The day-centric nature of the assessments produced was due to a lack of granularity in the timestamps
5.4 Crowdsourced Task: Identifying Top Events
Figure 5.3: A screenshot of the external judging interface shown to workers within the instructions.
judged as not important but part of the correct category, while a red x in square brackets indicates that the newswire article does not belong to this category. On the right hand side, the current newswire article to be assessed is shown, including the headline and content of the article. One worker assesses all 32 of the newswire articles listed.
Notably, we have workers assess 32 newswire articles from a single day to provide them with some context regarding the day for which they are assessing. This is critical, since newswire article importance is to some extent relative in nature. For instance, a news story that would be considered ‘front-page’ news on most days can be buried by unexpected and/or high impact events, such as a celebrity death. To this end, we asked that workers make two passes over the newswire articles. During the first and longer pass, the worker would assess each newswire article based on the headline and content of that article and the previous articles assessed, while upon the second pass, the worker can change their assessment for any article now that they have knowledge of more newswire articles from that day. We use workers from the Amazon’s Mechanical Turk (MTurk) marketplace, hence each task instance is an MTurk HIT containing 32 newswire articles to be assessed.
5.4.3
Validation of Worker Assessments
One of the criticisms of crowdsourcing is that it is susceptible to poor quality or malicious work. Best practises in crowdsourcing indicate that one or more validation strategies should be used to counteract this (Snow et al., 2008). As such, we have three individual workers perform each HIT. From these three judgements we take the majority vote for each story to create the final newsworthiness assessment for that news story. Furthermore, to assure the quality of the resulting judgements, we manually validate the HITs produced using the same colour coded summaries of the stories and the judgements that each worker produced that were described in the previous section. In particular, the summary of the judge-
5.4 Crowdsourced Task: Identifying Top Events
ments for the three workers were displayed side-by-side, as illustrated in Figure 5.4 (a). We examined each set of 32 assessments produced by the workers based upon 3 criteria:
1. Are all 32 stories judged?
2. Are the judgements similar across the 3 redundant judgements?
3. Are the stories marked important sensible?
Figure 5.4 (a) illustrates how the assessments produced by three workers for a single set of 32 newswire articles for the ‘world news’ category can be viewed as a colour-coded summary. We observe that in clear cut cases, like ‘NSE details S&P CNX Nifty Inde...’ shown in Figure 5.4 (a), which is clearly from the incorrect Business/Finance category, there is strong agreement between workers. In less clear cases, such as ‘Indian shares up as settlement...’ where the story could belong to either the World or Business/Finance categories, we observe some levels of disagreement. However, in this case, it is clear that the workers were completing the task in ‘good faith’ and hence the work was approved and paid for. On the other hand, Figure 5.5 (b) shows an example that we believe has been attempted by a bot, as only the first judgement was made before the HIT was submitted.
Figure 5.4: Displayed summary of three workers judgements for a single task instance.
Figure 5.5: Task instance possibly completed by a bot.
Furthermore, we also note that our assessment interface has a passive protection against spammers and automatic bots that might attempt our task. In particular, our task requires the worker to select each news story in turn, assess that story, and then submit those assessments only when all 32 have been completed. It is unlikely that an automatic bot would be able to complete this successfully, since they would likely select submit before all of the stories had been assessed.
5.4 Crowdsourced Task: Identifying Top Events
Category Important Not Wrong Agreement Important Category (Kappa Fleiss) U.S. News 21% 39% 40% 63.53% World News 24% 38% 38% 51.69% Sport 21% 29% 49% 77.67% Business/Finance News 24% 43% 33% 66.88% Science/Technology 4% 10% 86% 82.97% Average 19% 31% 49% 68.55%
Table 5.8: Judgement distribution and agreement on a per category basis.
5.4.4
Crowdsourcing Configuration
The crowdsourcing task that have described totals 24,000 story judgements (8,000 newswire articles * 3 workers per HIT) spread over 750 HIT instances. We paid our workers $0.50 (US dollars) per HIT (32 judgements), totalling $412.50 (including Amazon’s 10% fees). For this task, we only used workers from the U.S.. Our reasoning is that international workers would likely not be able to accurately judge the newswire articles since the come from the T RC2 Reuters newswire corpus (see Table 5.3). Any incomplete HITs were rejected. As such, we collect exactly 3 judgements per story.
Following an iterative design methodology (Alonso et al., 2008), we submitted our HITs in 6 distinct batches, allowing for feedback to be accumulated and HIT improvements to be made. We made minor modifications to the judging interface and updated the instructions based upon feedback from the work- ers. In the next section, we empirically evaluate the story ranking judgements produced. Screenshots of the instructions given to each worker are provided in Appendix B.
5.4.5
Assessment Accuracy
Recall that we have three individual workers assess each newswire article. To determine the quality of our assessments, we measure the agreement between our workers. Table 5.8 reports the percentage of judgements for each relevance label and the between-worker agreement in terms of Fleiss Kappa (Fleiss, 1971), on average, as well as for each of the five news categories. From Table 5.8, we observe that agree- ment on average is reasonable (69%). Meanwhile, we also observe that agreement varies markedly over news categories. For instance, the Science/Technology and Sport categories exhibit the highest agree- ment with 83% and 78% respectively, while the U.S. and World categories show less agreement. Based upon the class distribution for these categories, the disparity in agreement indicates that distinguishing science from non-science stories is easier than for the U.S. or World categories. This is intuitive, as the U.S. and World categories suffer from a much higher story overlap. For example, for the story “Presi-