anchoring in time and for around a fifth of these cases in no temporal anchoring at all while we do anchor them.
We can conclude, that even a dense TLINK annotation gives suboptimal informa- tion on when events have happened, and due to the restriction that TLINKs are only annotated in the same and in adjacent sentences, a lot of relevant temporal information gets lost.
4.4
Automatic Event Time Extraction
While the new annotation scheme is simple for humans to perform, it poses several challenges for automatic approaches:
1. The number of possible labels is infinite, as date values are part of the labels. 2. Due to the diverse types of events and due to varying temporal information for events, the structure of the labels varies. We have to distinguish between Multi-Day Events and Single Day Events. Further, a text might provide a pre- cise date for the event or only a rough estimate when the event has happened. 3. Temporal information from the whole document must be taken into account to derive the event time. For 58.72% of the events, the most informative time expression is not in the same nor in the neighboring sentences (Figure 4.3), but several sentences away from the event mention.
4. For 7.3% of the events, the event time label is a combination of several temporal clues. In the example ofFigure 4.1, the rendezvous event between Fidel Castro and the Pope will happen on the 21st of January. However, this date (1998- 01-21) is not mentioned in any temporal expression in the document. Instead, the annotator inferred this date from the document creation time (January 20th) and the phrase “this is the eve of the Pope’s visit”.
In order to solve the mentioned challenges, we propose a combination of a decision tree combined with neural network classifiers. The system works on the complete document and can extract long-range relations between events and temporal expres- sions. Further, it can extract begin and end points for events that span over multiple days.
Existent systems often use complex, handcrafted features. The CAEVO system (Chambers et al.,2014), for example, uses typed dependencies to identify dominant events. Further, it uses WordNet as an external resource to identify synonyms. Systems based on handcrafted features and external resources are often difficult to transfer to new languages or new domains. Our proposed system works end-to-end and does not incorporate any handcrafted features or external resources. Hence, it is simple to train the system on new datasets.
We evaluate the proposed system on our annotated data, where it achieves an accu- racy of 42.0% compared to an inter-annotator agreement (IAA) of 56.7%. Compared
to the state-of-the-art CAEVO system (Chambers et al., 2014), we observe a sub- stantial improvement of 33.1 percentage points accuracy for events that happened on a single day. We observe that the systems operates better for Single Day Events than for Multi-Day Events. For Single Day Events, the accuracy is at 74.6% (IAA at 80.5%), and for Multi-Day Events, the accuracy is at 24.5% (IAA at 52.0%). We show that the proposed model generalizes well to new tasks and textual domains. We applied it without re-training to the SemEval-2015 Task 4 on automatic timeline generation. There, it achieves an improvement of 4.01 points F1-score compared to
the state-of-the-art.
The proposed system has been published in the Transactions of the Association for Computational Linguistics (TACL) (Reimers et al.,2018).
Existent Event Time Extraction Systems
The architecture of approaches usually depends on how temporal information for events is provided in a corpus. The ACE 2005 corpus (Walker et al., 2005) defined time as a general argument for an event and annotated the span that expresses when the event happened (cf.section 4.1.2). Consequently, systems trained and evaluated on the ACE 2005 corpus, extract the event time like other event arguments. A common approach is to formulate this as a pair classification task: Given an event mention and a noun-phrase, a classifier is trained to decide whether the phrase is the time argument for the event. The limitation of such approaches is that the event arguments must be in the same sentence as the event trigger. For only 23.8% of the events in the ACE 2005 corpus is the event time mentioned in the same sentence. For the remaining 76.2% of the events, no temporal information is provided.
Instead of extracting the event time as an argument within the sentence, a large number of systems uses the previously introduced TLINKs (cf. section 4.1.1). A TLINK is defined as the relation between two events, two temporal expressions, or between an event and a temporal expression. The number of possible links grows quadratic with the number of event mentions. As a consequence, all corpora can only provide annotated links for a small subset of possible relations. For example, for the TempEval-3 dataset, 98.2% of the links between events are left unspecified by the annotators (Ning et al., 2017). Typical restrictions for the annotation are either to annotate only salient links and/or to only annotate links within the same sentence or neighboring sentences. As a consequence, automatic approaches, which are trained and evaluated on such corpora, produce labels only for a small subset of links. In a post-processing step, those TLINKs can be used to infer the calendar date for events.
Extracting the relations is often formulated as a pairwise classification task. Each pair of event and/or temporal expression are examined and classified according to the available relational classes. A recent system for dense TLINK extraction was pro- posed byChambers et al.(2014). The CAscading EVent Ordering system (CAEVO) is a sieve-based-architecture and was trained and evaluated on the TimeBank-Dense Corpus (Cassidy et al., 2014). Chambers et al. describe seven rule-based classifiers