• No se han encontrado resultados

ANÁLISIS Y DISCUSIÓN

INFORMACIÓN INDEPENDIENTE

While there are several publicly available temporally annotated corpora containing news documents (e.g., the TimeBank corpus), and also a publicly available corpus containing narrative-style documents (the WikiWars corpus), neither a corpus containing colloquial nor a corpus containing autonomic documents have been available so far. In this section, we present the corpora, which we annotated in the context of this work: Time4SMS and Time4SCI.

Colloquial Corpus Creation

Although there are some SMS corpora publicly available, there are four main requirements for the SMS corpus to be applicable for publishing a temporally annotated SMS corpus: (i) it has to be freely available to allow others to reproduce the corpus, (ii) the language of the messages has to be English since for

19

The excerpts are from the abstract of the paper “Supplementation with all three macular carotenoids: response, stability, and safety” by Conelli et al. (2011), http://www.ncbi.nlm.nih.gov/pubmed/21979997/ [last accessed April 8, 2014]. 20

In the BMBF-funded interdisciplinary project heureCLÉA, we are currently working with narratologists from Hamburg University on the automatic identification of temporal phenomena in literary text. The basis of this work is to identify and normalize temporal expressions, i.e., to perform the task of temporal tagging on literary text documents.

3 Cross-domain Temporal Tagging

developing a corpus for a new domain, English annotated corpora are most valuable for the research community, (iii) the corpus has to be large since the single messages are short and thus cannot contain many temporal expressions, and (iv) the document creation time (i.e., the time when the message was sent) has to be available for the messages. The availability of the sending time is crucial for normalizing underspecified and relative temporal expressions, which we expect to occur frequently in SMS texts.

Due to these requirements, we used the NUS SMS corpus (Chen and Kan, 2013) as basis of our colloquial corpus. However, the 2004 version of the corpus does not satisfy all our requirements, since these documents do not contain information about the sending time. Without the documents of the 2004 version, the corpus contains 28,268 messages (June 2011 version).21 Due to privacy reasons, the developers of the corpus anonymized all short messages automatically and sensitive data were substituted by placeholders. Unfortunately, multi-digit numbers and some specific time information were part of this sensitive data. To overcome this problem, we replaced these placeholders of digits and times by some standard values in the original format.22 Then, we randomly selected 1,000 documents as our SMS corpus called Time4SMS, in which we manually annotated all temporal expressions.

Scientific Corpus Creation

For the second new domain for our temporal analysis, we chose scientific documents. However, temporal expressions are not frequent in all kinds of scientific literature. A good representative of scientific documents with many temporal expressions are texts from the biomedical domain, e.g., publications about clinical trials. For selecting documents, we used PubMed,23which contains citations with abstracts and metadata such as publication dates of more than 20 million publications of the biological and biomedical domain. Using the PubMed search interface, we queried for “clinical trials” and downloaded abstracts and metadata of the 50 most recent publications. These documents form our scientific corpus called Time4SCI.

Annotation Procedure

As for the annotation of WikiWarsDE (cf. Section 3.4.5), we followed the developers of WikiWars (Mazur and Dale, 2010), i.e., we formatted the corpora in SGML, the format of the ACE TERN corpora. This makes it possible to evaluate temporal taggers on our newly annotated corpora using the publicly available TERN evaluation scripts.24The documents contain “DOC”, “DOCID”, “DOCTYPE”, “DATETIME”, and “TEXT” tags, and the document creation time (DATETIME) was set to the publication date being part of

the metadata of each Pubmed article. The “TEXT” tag surrounds the text that is to be annotated. For the annotation of temporal expressions, we used the TIDES TIMEX2 format (Ferro et al., 2005) with its attributes described in Section 3.2.1. Similar to Mazur and Dale (2010), we performed a three phase annotation process: (i) automatic pre-annotation, (ii) manual annotation with correcting wrong and adding missing expressions, and (iii) manual merging and validation of the annotations. For automatic pre-annotation, we used HeidelTime. Its output was then imported to the annotation tool Callisto25for the second annotation phase, the manual annotation and correction of wrong annotations.

21http://wing.comp.nus.edu.sg/SMSCorpus/[last accessed April 8, 2014]. 22

The NUS SMS corpus developers kindly provided their function to replace sensitive data, so that we were able to reproduce standard values for the placeholders in the original format.

23

http://www.ncbi.nlm.nih.gov/pubmed/[last accessed April 8, 2014]. 24

We provide all necessary evaluation script at http://code.google.com/p/heideltime/ [last accessed April 8, 2014]. In Section 3.6, further details about the evaluation will be provided.

25

3.3 Temporal Tagging Documents of Different Domains

Note that the fact of using our own temporal tagger for automatic pre-annotation should be considered when comparing evaluation results of our temporal tagger with other taggers. However, as mentioned by Mazur and Dale (2010), this procedure is motivated by two facts. Firstly, “annotator blindness is reduced to a minimum” (Mazur and Dale, 2010), i.e., that annotators miss temporal expressions. Secondly, the annotation effort is reduced significantly since one does not have to create a TIMEX2 tag for the expressions already identified by the tagger (Mazur and Dale, 2010). In addition, this procedure is justifiable for our purpose because the main goals of creating the corpora are to study the differences between documents from different domains and to analyze domain-dependent challenges.

During the second annotation phase, the documents were examined for temporal expressions missed by the temporal tagger and annotations created by the temporal tagger were manually corrected. This task was performed by two annotators – although Annotator 2 only annotated the extents of temporal expressions. The more difficult task of normalization was performed by Annotator 1 only, since a lot of experience in temporal annotation is required. At the third annotation stage, the results of both annotators were merged and all normalizations of the expressions were checked and corrected by Annotator 1.

Finally, the annotated files, which contain in-line annotations, were transformed into the ACE APF XML format, a stand-off markup format used by the ACE evaluations. Thus, the Time4SMS and Time4SCI corpora can be made available in the same two formats as the WikiWars corpus and the evaluation tools of the ACE TERN evaluations can be used with the new corpora.

During the manual annotation process, we were faced with domain-specific difficulties. Due to many unresolvable temporal expressions in the scientific corpus, we suggest a new way to normalize these expressions. However, since the normalization of unresolvable expressions is one of the main challenges of temporal tagging autonomic documents, the details of this issue and how it can be addressed are described in the Section 3.3.7 and Section 3.3.8, respectively. Furthermore, in contrast to news- and narrative-style documents, it is very challenging to annotate colloquial and scientific text since deep domain knowledge is needed to fully understand such documents. For this, we regard our newly developed annotated corpora as preliminary versions of a gold standard.

Documento similar