TIPO DE COMPONENTE
4. MATERIALES Y MÉTODOS 1 Contenido y desarrollo
4.3. Diseño, construcción y montaje de las instalaciones de paneles solares planos
4.3.1. SUBSISTEMA COLECTOR DE LAS INSTALACIONES DE PANELES SOLARES PLANOS.
In this chapter, we focussed on persons as an entity type with highly ambiguous names and proposed entity linking models using topic models. We evaluated topics as semantic labels for the disambiguation of person names in German and, inspired by the promising results, generalized the usage of topic models to derive thematic context distances over describing contexts. Relying on the distance over topic dis- tributions instead of descriptive word-vectors, this method can inherently handle synonymy and polysemy which is not the case for methods based on direct word comparison. While overly sparse text representations such as WTC or WCC may often perform well, such approaches can not grasp the similarity between terms like splendid and terrific and also often have a longer learning time.
We evaluated our method on reference data from Wikipedia in English, German and French and showed that similarity measures computed over latent topics are especially suitable to link mentions of persons to their underlying entities. Being more general than word based distances, the proposed thematic distances allow to exploit the thematic overlap between referring contexts and the biographic content of articles describing persons.
We have compared our approach to the most related method of Bunescu and Pasca [2006] and shown in detail that our method can significantly (p < 0.05) increase performance and improve the assignment of name mentions to the underlying articles in Wikipedia. Treating also mentions of entities that are not covered by an article in Wikipedia, we have shown that our method can handle this problem very accurately. This is a crucial aspect: When we retrieve information for a known entity, we don’t want to assign false facts to it. Comparing to the Wikipedia category based approach of Bunescu and Pasca [2006] or Cucerzan [2007], our approach is furthermore more flexible and applicable to different languages without expensive manual category analysis. At the time of publication, this method was the first to approach entity linking in multiple languages.
3.8 Summary
In this chapter, we focused on person name disambiguation in a purely contextual approach with simple matching techniques for candidate retrieval. As described in the overview on named entity linking, a straightforward match of mentions against Wikipedia titles or redirects can yield more than satisfactory results. Especially in edited news paper articles, persons are often mentioned with canonical names which may render candidate retrieval less crucial for persons. However, when generalizing to other entities we need more elaborate candidate retrieval techniques, for example to handle abbreviations. This is the subject of the next chapter where we will extend from a context based approach to a more collective, relational method.
Chapter 4
Local and Global Search for Entity
Linking
Outline
In the previous chapter we focused on the consolidation part of entity linking, es- pecially for mentions of persons, and treated each mention instance individually. In this chapter, we generalize entity linking to arbitrary entity types and introduce a global view on the document level by collectively linking the mentions in a document and doing so, focus more on the candidate retrieval part of entity linking.
We first introduce general entity linking that considers both named entities as well as abstract concepts (Section 4.1) and give an overview of related work with focus on recent collective approaches that investigate linking to Wikipedia (Section 4.2). We then describe our approach, a data driven method that exploits the structured and unstructured information encoded in Wikipedia by a carefully constructed search index (Section 4.3). The description of the proposed multi-stage algorithm starts with a brief summary (Section 4.4) that outlines the subsequent sections. Having described how mentions are enriched with various attributes used for linking (Sec- tion 4.5), we detail the stages of our entity linking algorithm. We propose a novel candidate retrieval method that collectively uses all mentions in a document and exploits the co-occurrence of links in Wikipedia. We assess relatedness through the collective fitness of candidate entities in the document in a novel coherence measure. Based on this coherence, we compute the best fitting candidate for each mention and combine this prioritization with local, contextual information in a second stage (Section 4.6). Finally, candidates are consolidated by a supervised ranking SVM (Section 4.7). The method is evaluated in an unsupervised (Section 4.8.2) as well as in a supervised variant (Section 4.8.4) on five different benchmark corpora.
This chapter covers the ideas and findings published in Pilz and Paaß [2012] and provides additional experimental evaluation to demonstrate the performance of the proposed method.
Chapter 4 Local and Global Search for Entity Linking
4.1 General Entity Linking
In this chapter, we aim at linking mentions of both concrete named entities as well as abstract entities or concepts. Doing so, we generalize from named entity linking or person name disambiguation to general entity linking. Note that even though conceptual entities are usually referenced by proper nouns, this task overlaps closely with word sense disambiguation. The latter aims at resolving ambiguity for all common words in text, e.g. adjectives, verbs and nouns, but does not necessarily include proper nouns and named entities.
Mallery [1988] termed word sense disambiguation an AI-complete problem that requires not only deep linguistic knowledge but often also world knowledge. For illustration, we give the following example, a modified version of the one given in Navigli [2009].
Example 8 (Word sense disambiguation)
Take the following two mentions of bass that denote two distinct concepts:
I can hear bass sounds. Bass (sound)
Paul liked the grilled bass. Bass (fish)
In the first sentence, the mention bass denotes low-frequency tones, i.e. the con- cept Bass (sound). The second mention refers to a type of fish, i.e. Bass (fish).
For a human reader, the hints provided in these short sentences above are sufficient to grasp the intended meaning of each mention. The respective sense of each mention is implied through the co-occurring context terms: hear and sounds hint at the concept Bass (sound), grilled hints at the concept Bass (fish). However, for the automatic inference of the intended senses, the available contextual evidence is rather poor. Some model would be required to reason on the relation between hear, sounds and bass to infer the concept of sound, likewise for grilled and the concept of fish. These relations are not explicitly given in the text but need to be inferred or learned from background or world knowledge, for example from statistics over co-occurring terms.
Mihalcea and Csomai [2007] define word sense disambiguation as the automatic assignment of the most appropriate sense to a word within a given context. This sense is taken from an inventory that is often assumed to be complete. Originally, the major sense inventory in word sense disambiguation was WordNet and ambiguity was resolved by assigning a word to a specific set of synonyms (i.e. a synset ) in WordNet