Capítulo X: Conclusiones y Recomendaciones
10.4 Futuro del OEFA en el Sector Cervecero Peruano
In addition to the user-generated thesauri and knowledge bases, many resources are gen-erated automatically or at least semi-automatically (with manual interaction). In this con-text, Wikipedia plays a crucial part. It was exploited by most of the approaches, because of the vast amount of information it comprises, as well as the relatively high data quality, currentness and free availability.
DBpedia39is among the most prominent and successful knowledge databases [11]. It is based on Wikipedia articles and exploits structural and syntactic specifics to extract valu-able information about entities. The main focus is on the Wikipedia info boxes that pro-vide structural information of entity data (especially persons, locations, albums, movies and the like, as well as chemical elements, biological or medical classifications), but also URLs, geo-coordinates and categories are detected and extracted. The authors use a pat-tern matching approach based on recursive regular expressions in order to find relevant templates in the raw article texts. If such templates are found, the page fragment is parsed and relevant information is extracted and post-processed to ensure high data quality [12].
Information extracted from Wikipedia is stored in an RDF framework and is interlinked with other data sources on the Web. This makes DBpedia to a central information hub in the Linked Open Data community. DBpedia provides SPARQL endpoints to retrieve information from its knowledge base, a web interface for manual information lookup and additionally allows downloading some content in SQL format. DBpedia is pro-vided for 125 different languages. The current version of the English DBpedia describes about 4.5 million concepts, and about 580 million facts were extracted from the English Wikipedia.40
YAGO41is a similar project to DBpedia, but uses different procedures to generate knowl-edge. Instead of extracting structural information from article texts, YAGO matches Wi-kipedia articles (mostly entity data) to WordNet synsets. It thus combines a lexicographic resource with a knowledge base and contains both lexicographic relations (like subclass-of ) and individual relations like wasBornIn, hasWonPrize or locatedIn. Relations between Wikipedia pages and WordNet synsets are derived by analyzing the categories where a specific Wikipedia page is located, analyzing the category title, extracting the most rel-evant concept term and finally matching it with WordNet. The category names are also analyzed w.r.t. relation detection, e.g., an article appearing in the category 1879 births would lead to the relation "Person" bornIn 1879; category names starting with Cities in or Attractions in indicate that the respective article describes a place located in the specified region. Further on, re-direct pages are used to detect synonymy (sameAs) relations. The accuracy of YAGO relations is between 90.8 and 98.7 %, depending on the relation type [142]. YAGO currently consists of about 10 million entities being organized within about 350,000 conceptual classes. It contains about 120 million facts (relations) in 10 different languages.
Focusing more on the linguistic context, BabelNet42is a sophisticated multilingual ap-proach that links Wikipedia pages to WordNet senses [102]. Given a Wikipedia page title like balloon (aeronautics), the difficulty of this approach is to find the corresponding Word-Net sense, which may be balloon1. To find the correct correspondence, the categories and external links of the Wikipedia page are analyzed, as well as synonyms, hypernyms and hyponyms of each possible WordNet sense. The WordNet sense with the largest
inter-39http://dbpedia.org
40http://wiki.dbpedia.org/about/facts-figures
41http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/
research/yago-naga/yago
42http://babelnet.org/
section is chosen as match partner. Subsequently, WordNet synsets are enriched by the different languages where a Wikipedia article is provided, e.g., BalloonEN, BallonDE, AerostatoES, which can be easily attained, as Wikipedia pages contain lists of equivalent articles in other languages. BabelNet can also retrieve translations in case that a link to a specific language is not available, by using the SemCor43corpus and machine transla-tion techniques. The mapping algorithm for mapping Wikipedia pages to the respective WordNet sense achieved a precision of 82 % and a recall of 78 %. However, since Babel-Net links Wikipedia pages to WordBabel-Net senses, it does not generate new semantic rela-tions within a language and is rather useful in case that ontologies of different languages should be mapped.
The most recent version, BabelNet 3.0, covers 271 different languages and stores 6.4 mil-lion concepts in 13.8 milmil-lion synsets. It contains more than 354 milmil-lion lexicographic-semantic relations.44 In addition to WordNet, further background knowledge resources have been automatically integrated, such as Wiktionary and Wikidata.
Sumida and Torisawa exploit the structuring of Wikipedia pages (such as headings, sub-headings, sub-sub-sub-headings, etc.) together with pattern matching and linguistic features to extract hyponym relations from the Japanese Wikipedia [143]. They were able to re-trieve 1.4 million relations with a precision of about 75 %.
While the previous approaches focus rather on structural or semi-structural techniques for information extraction from Wikipedia, some approaches try to extract semantic re-lations directly from the article texts. Hearst patterns are an important prerequisite to extract semantic relations from unstructured texts. They comprise expressions like "A, such as B and C" or "A, which is a specific B" and indicate synonym and hyponym rela-tions (for instance, B and C are hyponyms of A in the first example). Several approaches to learn such Hearst patterns are based on Wikipedia texts [124, 125] or on newswire corpora [135]. In [72], Hearst patterns are used to derive an ontology from biomedical Wikipedia articles. They obtain a rather poor recall (20 %), but excellent precision (89 %).
In [18] the authors developed patterns for part-of relations based on Hearst patterns.
Using a newspaper corpus comprising 100 million words, they could achieve an accuracy of 55 %, meaning that 55 % of the extracted part-of relations were considered to be correct.
Other approaches focus on machine learning of of relations [55] or on detecting part-of relations in corpora by means part-of word similarity measurement [89].
In [23], the authors developed an approach to extend WordNet by so-called telic relations by searching for specific patterns in the WordNet glosses. Telic relations describe the purpose between concepts like wood – fire or wood – furniture ("used for" relations) and are especially relevant for text analysis or question answering.
Further on, Wikipedia is widely used for related linguistic tasks, which do not primarily focus on semantic relation extraction. For example, Flati and Navigli parse Wikipedia articles to find the collocations of specific words, e.g., "break *". From the results they ob-tain, they derive more general concepts like "break<Agreement>" or "break <Body Part>"
43http://www.gabormelli.com/RKB/SemCor_Corpus
44http://babelnet.org/stats
[50]. Wikipedia is also used to determine the semantic relatedness between concepts or expressions [139, 51] and for word sense disambiguation [114].
WikiTaxonomy is a further approach that extracts semantic relations from Wikipedia, which is based upon an earlier approach of Wikipedia taxonomy extraction [115]. The authors parse the whole category system of Wikipedia and could extract some 100,000 is-a relis-ations using lexicogris-aphic, syntis-ax-bis-ased is-and inference-bis-ased techniques [116]. The extracted taxonomy comprises both entities and concepts and the authors reached an F-measure of 88 % on the extracted relations. They also use approaches to designate entities and concepts within the extracted taxonomy [158]. Wu et al. use Hearst patterns based on a large web corpus. Parsing 1.7 billion web pages, the authors could extract more than 2.6 million concepts and about 20.7 million relations between concepts and sub-concepts as well as concepts and instances. Though their focus is more on language-specific techniques, like named entity recognition and question answering, the gathered knowledge seems also suitable for schema and ontology mapping [152].
MultiWordNet45 is a multilingual thesaurus similar to the manually developed Eu-roWordNet. Instead of interlinking existing word nets, the authors propose the develop-ment of new word nets based on the semantic structure of the English Princeton Word-Net. They obtain such word nets by the semi-automatic development of synsets from the Princeton WordNet using bilingual dictionaries [111]. As a result of this project, an Italian word net consisting of 28,120 synsets and 36,514 words has been derived, whose semantic structure is mostly identical to WordNet. However, no further word nets have been developed so far, but the project also interlinked some already existing resources.
Altogether, it allows access to 7 languages (English, Italian, Spanish, Portuguese, Hebrew, Romanian and Latin).