• No se han encontrado resultados

Historia y desarrollo

In document Valuación de Walmart Stores, Inc (página 18-0)

2. Descripción del Negocio

3.1 Historia y desarrollo

cies between chunks are usually identified by the grammatical functions (subject, object), whereas the dependencies between the components of a chunk are de- termined by annotating the head, complement and modifier of a chunk. At the sentence level, the dominating node in the sentence tree is the predicate4. At the chunk level the dominating node of the tree is the identified head of the chunk. The complements within a chunk are the necessary qualifiers of the head and the modifiers within a chunk are the optional qualifiers of the head. As with chunk parsers, the dependency parsers can be either rule-based such as SCHUG, which we used, or statistical parsers, such as LoPar (Schmid, 2000) and Minipar (Lin, 1998).

2.2

Semantic Resources

Although not all NLP tasks require semantic tagging, such tagging has proved to be helpful for information extraction in general, and also for the more specific domain of ontology learning. Applications in these NLP fields require semantic analysis, which is performed on the basis of available semantic resources. Seman- tic resources are typically semantic lexicons, thesauri and semantic networks. In the next sections we describe some semantic lexicons, thesauri and semantic net- works. Although we use in this thesis only the semantic lexicon GermaNet (Kunze and Lemnitzer, 2002), we also present here the closely related semantic thesauri and semantic networks. The presentation of the semantic thesauri and semantic networks in this context is motivated by the fact that this type of semantic re- sources could also be easily integrated into the work presented in this thesis. We decided to use only GermaNet as a semantic resource because we considered it the most appropriate for the method presented in this thesis.

2.2. SEMANTIC RESOURCES 12

2.2.1

Semantic Lexicons

Semantic lexicons are semantic resources that group together words according to lexical semantic relations like synonymy, hyponymy, meronymy and antonymy (Buitelaar and Declerck, 2003). WordNet, EuroWordNet, GermaNet are semantic lexicons. Semantic lexicons are in fact lexicons enhanced with semantic informa- tion.

WordNet

WordNet5 is a lexical reference system developed by the Cognitive Science Lab- oratory at Princeton, available online and whose design is inspired by psycholin- guistic theories of human lexical memory. Although linguistically motivated, many groups have used it as a general ontology of concepts.

Within WordNet English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. It covers cur- rently over 90000 semantic classes (synsets). Different relations link the synonym sets (e.g. antonyms, generalizations, etc). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. For example, the two synsets [board, plank] and [board, committee] are grouped together because a board and a plank are similar lexical items. At the same time, a board may also refer to a group of people. The synsets in WordNet range from very specific to very general, specific synsets covering a small number of items, general ones a large number of items.

2.2. SEMANTIC RESOURCES 13

GermaNet

GermaNet6 is a lexical-semantic lexicon that relates German nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into synsets and by defining semantic relations between these synsets. GermaNet has much in common with the English WordNet and might be viewed as an on-line thesaurus or a light-weight ontology. GermaNet contains 57776 synsets, 81773 lexical units, 72057 literals, 12042 lexical relations and 68997 conceptual relations.

EuroWordNet

EuroWordNet7is a multilingual database for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian). EuroWordNet is struc- tured in the same way as the WordNet in terms of synsets (sets of synonymous words) with basic semantic relations between them. Each language is represented with a unique internal system of lexicalisations. In addition, the languages are linked to an Inter-Lingual-Index, which is based on WordNet1.5. Via this index, the languages are interconnected so that it is possible to go from the words in one language to similar words in any other language.

FrameNet

FrameNet (Fillmore, 1982) is an online lexical semantic resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatoric possibilities (va- lences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation re-

6http://www.sfs.uni-tuebingen.de/GermaNet/ 7http://www.illc.uva.nl/EuroWordNet/

2.2. SEMANTIC RESOURCES 14

sults. The major product of this work, the FrameNet8 lexical database, currently contains more than 11600 English lexical units, more than 6800 of which are fully annotated, in more than 960 semantic frames, exemplified in more than 150000 annotated sentences. The creation of the German FrameNet was also part of the SALSA (The SAarbr¨ucken Lexical Semantics Annotation and Analysis) project9.

2.2.2

Thesauri

According to Bußmann (2008), a thesaurus is a dictionary in which the lexical items of a language are arranged systematically. A more specific definition is given by Buitelaar and Declerck (2003), which describe thesauri as semantic re- sources which group together similar words or terms according to a standard set of relations like broader term, narrower term, etc. A thesaurus may also include language equivalents and translation terms. In the following we present the Roget thesaurus and the Medical Subject Headings (MeSH) thesaurus.

Roget

Roget10 is a thesaurus of English which groups words in synonym and antonym categories. First published in 1852, the Roget thesaurus has evolved to one of the widely used dictionaries.

MeSH

MeSH11 is the United States National Library of Medicine’s controlled vocabu- lary thesaurus. It consists of sets of terms naming descriptors in a hierarchical

8http://framenet.icsi.berkeley.edu/

9http://www.coli.uni-saarland.de/projects/salsa/page.php?id=index-salsa1 10http://machaut.uchicago.edu/rogets

2.2. SEMANTIC RESOURCES 15

structure that permits searching at various levels of specificity.

2.2.3

Semantic Networks

Bußmann (2008) defines semantic networks as graphs in which the nodes are connected to each other by relations. The same definition is also provided more explicitly by Buitelaar and Declerck (2003) which defined semantic networks as semantic resources that group together objects denoted by natural language ex- pressions (terms) according to a set of relations that originate in the nature of the domain of application (The UMLS Semantic Network, CYC).

UMLS

The UMLS12 is a compilation of more than 60 controlled vocabularies in the biomedical domain and is being created by the National Library of Medicine under an ongoing research initiative that supports applications in processing, re- trieving, and managing biomedical text (Rindflesch and Aronson (2002)). Some of the medical terminologies integrated in UMLS are the Medical Subject Head- ings (MeSH), Systematized Nomenclature of Medicine (SNOMED), International Statistical Classification of Diseases and Related Health Problems (ICD), Physi- cians’ Current Procedural Terminology (CPT), and Clinical Terms Version 3 (Read Codes).

The UMLS Knowledge Source is structured around three separate components: the Metathesaurus, the Semantic Network, and the SPECIALIST lexicon. The UMLS Metathesaurus is a multilingual thesaurus which contains semantic in- formation about more than 8000000 biomedical concepts, each concept having variant terms with synonymous meaning.

In document Valuación de Walmart Stores, Inc (página 18-0)

Documento similar