We focus on the frequent occurrence of recall errors – where one annotator marks an event that the other misses – and compare it to the ace05annotations. Although there are simple cases
where an annotator’s failure to mark an event is not easily explained (perhaps the investigation in sentence 1. of Figure 3.4 is one such instance), more often they seem attributable to atypical event-referential language or typologically/ontologically-borderline events.
Recall of atypical references Cases where annotators agree on an event at the document
level but disagree with regard to sentences often correspond to sub-salient, if not oblique, references:
(20) Speaking at ANZ Stadium, where the Bulldogs winger will be hoping to score the seven points he needs to break the record in Saturday’s [Sports match game]A,Ba against Manly,
Greenberg said El Masri deserved a share of the spotlight no matter what else was going on. . . “I’m trying not to think about it too much, but it would be nice if it all fell into place this[Sports match week]Ba, at our home stadium,” El Masri said.
(21) When asked if he knew what had happened to Albert and Mario Frisoli, who were found [Lifecycle, Conflict murdered]A,Bb in their Rozelle home last week, Mr Di Cianni’s son Robert
said he did not know. . . Yesterday police said their inquiries were focused on three or four lines of inquiry and the family appealed for help in finding the brothers’ [Lifecycle killer]Bb .
In the former example, annotators agree on the reference to a sports match in the article. However, annotator B alone understands this week, at our home stadium as referring to the game mentioned in an earlier sentence. The typical characterisation of event as having a particular time and place of occurrence is applied to the extent that the what of the event is elided in context. The latter example similarly includes an antecedent marked by both annotators, but the use of an agentive nominalisation, killer, arguably assumes the death event rather than asserting it. Such indirectness may cause A to miss references.
3.2. Type-driven annotation experiment 61
Recall of borderline events A large portion of the disagreement results from the difficulty
of determining whether a fact is to be considered a specific event or of a particular type. Annotators are particularly prone to disagree when the referent in question diverges from the prototypical “event” in not having a clear time or place of occurrence. For example, compare annotation of the following contiguous sentences:
(22) Shares in the retailer [Finance lost]A,B $1.86, or 7 per cent, to close at $26.14.
(23) The company has been able to[Finance deliver]Adouble-digit profit growth every year since
1999 and its share price is based on this.
(24) Mr Luscombe assured investors he could continue to [Finance deliver]A double-digit growth
over the medium term in all businesses.
The latter two sentences were not annotated by B as referring to Finance events, perhaps because they are not clearly specific (or not events). Similarly, while B sees sanctions as anchoring both Conflict and Governance events in the following, A probably disregards the sanctions as an event:
(25) In return for a tougher array of United Nations [Conflict, Governance sanctions]B against Iran
targeting the country’s vast oil and gas reserves, . . .
Referents such as these make event extraction difficult: there is a long and heavy tail – in comparison to named entity recognition, for instance20 – of references and referents that are not prototypical events, or not prototypical to their type. When performing type-driven annotation, it is therefore easy for annotators to drift in leniency towards atypical events and atypical type instances, resulting in disagreement.
Comparison to ace05 The caveat of sample size notwithstanding, we may compare these
results to annotator agreement on the English newswire portions of ace05. We have previ-
ously reviewed inter-annotator agreement in ace05 in relation to subtype homogeneity and
predictability (Section 2.5.1), and to identify features of low-salience references (Section 2.4). Here we more directly consider agreement and adjudication of coarse event type identification at the document and sentence levels. The ace05 corpus is annotated by two independent,
“first-pass” annotators, f p1 and f p2, and these are adjudicated by adj. We hence report the chance-corrected agreement (Cohen’s κ) between the independent annotations, and F measure between all pairs in Table 3.4.21
20
Even in that task, a long but lighter tail of named entities is accounted for in the conll task through a misc label (Tjong Kim Sang, 2002). Due to a lack of syntactic and orthographical cues, a similar type for events would be difficult to scope.
21
Type Documents Sentences κ F1 κ F1 f p1 f p2 f p1 f p2 f p1 adj f p2 adj f p1 f p2 f p1 f p2 f p1 adj f p2 adj Business 0.6 0.7 0.8 0.8 0.4 0.5 0.6 0.7 Conflict 0.7 0.9 1.0 1.0 0.7 0.7 0.8 0.9 Contact 0.7 0.8 0.9 0.9 0.6 0.6 0.7 0.7 Justice 0.8 0.9 0.9 1.0 0.8 0.8 0.9 0.9 Life 0.8 0.9 0.9 0.9 0.6 0.7 0.8 0.8 Movement 0.6 0.8 0.9 0.9 0.5 0.6 0.7 0.8 Personnel 0.8 0.8 1.0 0.9 0.5 0.5 0.7 0.8 Transaction 0.5 0.6 0.9 0.9 0.4 0.5 0.7 0.6
Table 3.4: Inter-annotator event type agreement in newswire portions of the ace05 corpus.
We report Cohen’s κ between the two first-pass annotators (f p1 and f p2) and F measure (F1) between all pairs including the adjudicator (adj).
0 50 100 150 200 250 300 350 Conflict Movement Justice Contact Life Personnel Transaction Business
Sentences with annotated event reference
Annotated ev en t typ e J∩ A ∩ B A∩ B\J J∩ A\B J∩ B\A J\A\B A\J\B B\J\A
Figure 3.5: Sentence-level event-type annotation contingency for the most frequent types annotated in the newswire portion of the ace05evaluation corpus. A, B and J are the sets
of sentences labelled by each annotator. A and B correspond to first-pass annotators f p1 and f p2 respectively. J corresponds to adjudication (adj).
3.2. Type-driven annotation experiment 63
Document-level agreement between first-pass annotators is quite high, the lowest-agreement being Transaction which is infrequent in the corpus.22 In contrast to our annotation, the high agreement in ace05 reflects the annotators being more thoroughly trained to a de-
tailed schema, and only marking events that fall under sought subtypes and with marked participants. The news text being annotated is also much more topically homogeneous, so annotators are able to focus on the types of event relevant to the topic, as evidenced by the outlying high frequency of Conflict events.
At the document level, the adjudicator almost entirely agrees with the first-pass anno- tators. However, agreement drops substantially when considered at the sentence level, for independent annotation and for adjudication. This suggests the difficulty of identifying all references of an event type (or their correct classification), and indeed Figure 3.5 shows that for the Conflict type, the adjudicator adopts many annotations from f p2 not marked by
f p1, and a few in the opposite direction. This recall problem we have identified in our own
annotation was further discussed with respect to ace05in Section 2.4.
More surprisingly, in all types, more sentences are rejected in adjudication than are adopted from either single first-pass annotator in a case of disagreement: both first-pass annotators substantially overgenerate annotations with respect to the schema. This is sug- gestive of the annotation drift noted in our task.
We proposed the use of a pre-specified typology as a way to constrain the referent space to make the event detection task feasible, reliable and useful. Despite this, in our work and in ace05, it seems that the complexity and variability among events and their references
results in both undergeneration – we suggest due to atypical, low-salience references – and overgeneration – due to borderline, perhaps unsought referents – of manual annotations.