Lara Reyes Mónica Paola
ÁREA DE CIENCIAS NATURALES Ejes curricular integrador: Comprender las interrelaciones del mundo natural y sus cambios
We carried out an extrinsic evaluation of EuroSense by mapping its refined sense annotations for English to WordNet, and using them as a training set for the same supervised WSD system used in Section 4.1.3.2: It Makes Sense (Zhong and Ng, 2010, IMS). Following Taghipour and Ng (2015b), we started with SemCor (Section 3.1.1.1) as initial training dataset, and then performed a subsampling of EuroSense up to 500 additional training examples per word sense. Crucially, instead of sampling randomly as in Taghipour and Ng (2015b), we sorted sense annotations by decreasing coherence score, and considered the top occurrences of each word sense. We then
26
Both Babelfy and the baseline always attempt an answer for every possible disambiguation target, hence they achieve maximum coverage in each configuration. Note that in Table 4.13 we consider coverage (i.e. number of content words covered) in place of recall, since the number of ‘correct’ answers is not clearly defined in many cases, e.g. with overlapping mentions (as discussed
4.3 SenseDefs: A Multilingual Disambiguation of Textual Definitions 85
trained IMS on this augmented training set and tested on the two most recent standard benchmarks for all-words WSD: SemEval-2013 and SemEval-2015, from the standardized framework of Raganato et al. (2017a). As baselines we considered IMS trained on SemCor only and on OMSTI (Section 3.1.2.2). As shown in Table 4.14, where we also include two knowledge-based systems, Babelfy and UKB (Agirre et al., 2014), the MFS baseline, and the current state of the art (SOTA) on both datasets (Raganato et al., 2017a), IMS trained on the EuroSense-augmented training set consistently outperforms all baseline models, showing competitive results even against IMS trained on semi-automatic sense annotations (Taghipour and Ng, 2015b). Even though the F-score increase is not statistically significant in these specific benchmarks, it demonstrates that our fully automatic method can perform on par with semi-automatic approaches in extracting high-quality sense annotations.
Final Remarks. Our experimental evaluation shows, once again, that exploiting at best the features of the target text is crucial to achieve high-quality disambiguation in a fully automatic fashion. Specifically, with EuroSense we explored the effectiveness of multilinguality in the disambiguation process: instead of relying on external translations or pre-computed alignments, however, we let semantic coherence across languages emerge naturally at disambiguation time, thanks to the flexibility of a language-independent sense inventory and its multilingual lexicalizations. In contrast to the disambiguation pipeline of Section 4.1, building EuroSense required using two external tools, Babelfy and Nasari, and a structured pipeline to cope with their respective shortcomings. The proved benefits of this solution are: (1) the
release of two different versions of EuroSense, complementary with respect to the
downstream applications they are most suitable for; (2) the fact that each sense
annotations is associated with multiple confidence scores (Section 4.2.2) enabling to
further tune EuroSense for a specific task, application, or use.
4.3
SenseDefs: A Multilingual Disambiguation of Tex-
tual Definitions
In this third and final disambiguation scenario our target is definitional text. We focus on a large definitional corpus that shares some features with the Wikipedia corpus of Section 4.1 (i.e. the encyclopedic nature), as well as some features with the parallel corpus of Section 4.2 (i.e. equivalent sentences in multiple languages), with, however, an important difference: the short and concise nature of definitions.
Why Definitions? In addition to lexicography, where their use is of paramount importance, textual definitions (or glosses) drawn from dictionaries or encyclopedias have been widely used in various NLP tasks and applications. Definitional knowl- edge is effective inasmuch as it conveys the crucial semantic information and the distinguishing features of a given subject (definiendum): this means that, on the one hand, a definition often provides a fair amount of discriminative power that can be leveraged to automatically represent and disambiguate the definiendum; on the other, definitions are usually concise and encode “dense”, virtually noise-free information that can be best exploited with knowledge acquisition techniques. To date, some of
the areas where the use of definitional knowledge has proved to be key in achieving state-of-the-art results are Word Sense Disambiguation (Lesk, 1986; Banerjee and Pedersen, 2003; Navigli and Velardi, 2005; Agirre and Soroa, 2009; Faralli and Navigli, 2012; Fernandez-Ordonez et al., 2012; Chen et al., 2014; Basile et al., 2014; Camacho Collados et al., 2015b), Taxonomy and Ontology Learning (Velardi et al., 2013; Flati et al., 2016; Espinosa Anke et al., 2016c), Information Extraction (Richardson et al., 1998; Delli Bovi et al., 2015b), Plagiarism Detection (Franco-Salvador et al., 2016), and Question Answering (Hill et al., 2016). In fact, textual definitions are today widely available in knowledge resources of various kinds, ranging from lexicons and dictionaries, such as WordNet (Section 2.1.1) or Wiktionary, to encyclopedic Wikipedia-derived knowledge bases (Section 2.1.2). Interestingly enough, sources of definitional knowledge also include Wikipedia: despite its purely encyclopedic nature, and although the format of a Wikipedia article does not include an explicit gloss or definition, the first sentence of each article is generally regarded as the definition of its subject.
Related Work. Disambiguating definitions has attracted a considerable amount of interest over the years. Among others, WordNet has definitely been the most popular and the most exploited target resource in this respect, as WordNet glosses have still been used successfully in recent work (Khan et al., 2013; Chen et al., 2015). A first attempt to disambiguate WordNet glosses automatically was proposed as part of the eXtended WordNet project (Novischi, 2002).27 However, this attempt’s estimated coverage did not reach 6% of the total number of sense-annotated instances. Moldovan and Novischi (2004) proposed an alternative disambiguation approach, specifically targeted at the WordNet sense inventory and based on a supervised model trained on SemCor (Section 3.1.1.1); another disambiguation task focused on WordNet glosses was presented as part of the Senseval-3 workshop (Litkowski, 2004). However, the best reported system obtained precision and recall figures below 70%, which is arguably not enough to provide high-quality sense-annotated data for current state-of-the-art NLP systems. In addition to annotation reliability, another issue that arises when producing a corpus of textual definitions is coverage. In fact, reliable corpora of sense-annotated definitions produced to date, such as the Princeton WordNet Gloss Corpus (Section 3.1.2.1), have usually been obtained employing human annotators and, we discussed extensively in previous sections, human supervision is increasingly expensive and time-consuming as the size of the sense inventory grows larger. Furthermore, new encyclopedic knowledge about the world is constantly being harvested, and WordNet’s definitions fail to capture many up-to-date concepts and entities. With a view to tackling this problem, a great deal of research has recently focused on the automatic extraction of definitions from unstructured text (Navigli and Velardi, 2010; Benedictis et al., 2013; Espinosa Anke and Saggion, 2014; Espinosa Anke et al., 2015; Dalvi et al., 2015); as a consequence, disambiguating definitional text has to be framed necessarily as a large-scale task.
Motivation. Irrespective of the nature of the knowledge source, an accurate semantic analysis of textual definitions is made difficult by the short and concise
27
4.3 SenseDefs: A Multilingual Disambiguation of Textual Definitions 87
nature of definitional text, a crucial issue for automatic disambiguation systems that rely heavily on local context. Furthermore, the majority of approaches making use of definitions are restricted to corpora where each concept or entity is associated with a single definition; instead, definitions coming from different resources are often complementary and might give different perspectives on the definiendum. Moreover, equivalent definitions of the same concept or entity may vary substantially according to the language, and be more precise or self-explanatory in some languages than others. In fact, the way a certain concept or entity is defined in a given language is sometimes strictly connected to the social, cultural and historical background associated with that language, a phenomenon that also affects the lexical ambiguity of the definition itself. This difference in the degree of ambiguity when moving across languages is especially valuable in the context of disambiguation, as we demonstrated in the previous disambiguation scenario (Section 4.2).
In light of this, in the present section we adapt the disambiguation pipeline designed for EuroSense to a definitional setting. The underlying disambiguation idea is, indeed, almost the same: bringing together definitions drawn from different resources and different languages, and exploiting their cross-lingual and cross-resource complementarities at disambiguation time. As in the case of EuroSense, a large- scale high-quality disambiguation requires us to use off-the-shelf techniques which, for flexibility and scalability purposes, are based on a single multilingual disambiguation model. In fact, while language- and resource-specific techniques can certainly be used for disambiguation, the number of models required would add up to the order of hundreds, without even considering the need for large amounts of sense-annotated data for each language and resource. Therefore, we first gather a target corpus of textual definitions in multiple languages from BabelNet (section 4.3.1); then we apply the two-stage disambiguation pipeline described in Sections 4.2.1 and 4.2.2 to each group of definitions referring to the same definiendum (Section 4.3.2). As a result we obtain SenseDefs (Camacho Collados et al., 2016a)28 a multilingual corpus of textual definitions featuring over 38 million definitions in 263 languages, with almost 250 million sense annotations for both concepts and named entities drawn from the BabelNet sense inventory. Following the same methodology of Sections 4.1 and 4.2, we examine some global statistics about the corpus in Section 4.3.3, and then we carry out an experimental evaluation in Section 4.3.4, including both intrinsic and extrinsic experiments.
4.3.1 Gathering Definitional Knowledge across Resources and Lan-
guages
We construct a target corpus of definitional knowledge by collecting all textual definitions associated with every concept or named entity inside BabelNet, for all the languages available. Being a merger of various different knowledge resources (cf. Section 2.1.3), BabelNet provides a very heterogeneous set of definitions. Specifically, the definitional knowledge inside BabelNet comes from the following sources:
• WordNet: being hand-crafted by expert annotators, definitional knowledge
28
from WordNet is among the most accurate available and includes non-nominal parts of speech rarely covered by other resources (e.g. adjectives and adverbs). However, given its considerably smaller scale, WordNet provides less than 1% of the overall number of definitions in BabelNet, and covers only the English language;
• Wikipedia: Wikipages do not provide explicit glosses or definitions, however, according to the style guidelines of Wikipedia,29 a Wikipage should begin with a short declarative sentence defining what (or who) the subject is and why it is notable. Following previous literature, we also consider the first sentence of a Wikipage as a valid definition of the corresponding concept or entity. Furthermore, text snippets drawn from the associated disambiguation pages can also be regarded as definitions.30 Wikipedia provides the largest proportion of definitional knowledge by far (∼ 77%), including many definitions in languages other than English;
• Wikidata: Wikidata is the second largest individual contribution to SenseDefs (more than 8 million items and ∼ 22% of the total), even though, given its
strictly computational nature, it often provides minimal definition phrases containing only the superclass of the definiendum.
• Wiktionary, OmegaWiki: beyond WordNet, Wikipedia and Wikidata, the remaining definitions (∼ 1% of the total) are provided by two collaborative multilingual dictionaries: Wiktionary and OmegaWiki. Wiktionary31 is a Wikimedia project designed to represent lexicographic knowledge that would not be well suited for an encyclopedia (e.g. verbal and adverbial senses). It is available for over 500 languages typically with a very high coverage, including domain-specific terms and descriptions that are not found in WordNet. Similar to Wiktionary, OmegaWiki32 is a large multilingual dictionary based on a relational database, designed with the aim of unifying the various language- specific Wiktionaries into a unified lexical repository.
Overall, the corpus of definitional knowledge obtained from BabelNet comprises more than 38 million definitions associated with more than 8 million synsets, both concepts and named entities (see Section 4.3.3). The key feature of this corpus, that we will leverage at disambiguation time, is the fact that BabelNet’s inter-resource and inter-language mappings enable us to combine multiple definitions (drawn from different resources and in different languages) of the same concept or named entity. Thus, if we re-arrange the corpus by grouping all the definitions by definiendum, we can view it as a collection of around 8 million multilingual definitional texts.
29
https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style
30
The release format of SenseDefs (cf. Section 6.3) specifies two distinct attribute values for definitions extracted from the first sentence of Wikipedia articles (WIKI) and definitions extracted from disambiguation pages (WIKIDIS).
31
https://www.wiktionary.org
32
4.3 SenseDefs: A Multilingual Disambiguation of Textual Definitions 89