Investigaciones nacionales - Marco histórico 10

1. CAPÍTULO I: PLANTEAMIENTO DEL PROBLEMA

2.1. Marco histórico 10

2.2.2. Investigaciones nacionales

For semantic applications the hierarchy of terms or concepts is of great importance.

Relationships, such as is−a (hypernymy) and part−o f (meronymy) are defined

in a broader sense as subsumption relations of implication which relate to more general concepts in conceptual taxonomies. Subsumption can be defined as follows:

Definition 2.7 (subsumption). A subsumption defines a lattice (partial ordering) possibly

represented as a directed acyclic graph (DAG). In a DAG child nodes may have more than one parent node and hence the graph does not necessarily have to be a tree. The subsumption relation may be seen as a generalisation relation, where the subsumer states a generalisation over the subsumed.

Table 2.14 lists eight methods capable of obtaining subsumption (taxonomic) re- Overview on

taxonomy generation methods

see Table 2.14

lationships. To extract such relationships from text, there have been two classes of approaches described in literature, namely

• lexico-syntactic methods: 4 of the 9 listed methods extract relationships from text using manual or learned patterns of hyponymy.

• statistical methods: 4 of the 9 listed methods employ statistical measures to de- termine the existence of taxonomic relationships

Both classes rely on the distributional hypothesis introduced by Harris (1968) which defines that two words which appear in many similar linguistic contexts are semantically similar. Lexico-syntactic methods analyse features of words and how they are composed or modified. Statistical methods analyse the occurrence, co-occurrence, and the distribution of words within and between documents.

In the following section a number or relevant publications are presented. Reflect- ing the nature of the methods the section is structured in parts for methods relying on syntactic patterns and methods relying on (statistical) similarity measures.

Taxonomy learning methods using syntactic patterns

Hearst (1992) A first example for lexico-syntactic methods are Hearst-patterns, who compiled a set of lexico-syntactic patterns usually used to describe subsumption (mainly hypernymy) in text. Examples are: A is a B or B such as A. With these patterns one can infer e.g. from the text fragment “organelles such as mitochondria”, that mitochondria are organelles. To show the wide usage of such patterns the authors analysed corpora and e.g. found in the New York Times news corpus (20 million words) a total of 3178 sentences containing “such as”. Generally it can be said, that the application of Hearst-patterns lead to high precision, but a low recall, since many relationships are not made explicit in text.

Faure and N’edellec (1998, 1999); Faure and Poibeau (2000) proposed a different technique called conceptual clustering. After the acquisition of syntactic frames in a text, the learning method relies on the observation of syntactic reg- ularities in the context of words, for example for such an instantiated syntactic frames is <to travel> <subject: [father, neighbour, friend]> <by: [car, train]>. Concepts found are grouped according to their semantic distance and be- come this way ordered in a hierarchy. For this, no manual curation is needed before- hand, but the validation of the result is performed manually and is therefore time- consuming. A pattern-based learning approach instead will use labelled examples for extracting instances from texts. While the annotation of the learning examples is time-consuming, the quality of the learning results is be predictable and can be validated automatically.

Method Characteristics Comment

syntactic patterns statistics

Hearst (1992) 4 In an example only 42/3178 (Grolier’s Encyclopedia) and 152/7067 (New York Times) sentences which contain “such as” were found to contain a hypernym relation. Hears patterns have high precision (>0.90) but low recall (<<0.10).

Caraballo (1999) 4 4 0.33 use of Hearst patterns; precision (strict), 0.60 precision (by one human judge)

Sanderson and Croft (1999) 4 co-occurrence measure; no clustering; no learning; 0.48 precision (baseline 0.28)

Faure and Poibeau (2000) 4 learning of patters for relations from labeled examples Cimiano et al. (2005) F1=0.41 (Tourism), F1=0.33 (Finance)

Snow et al. (2004) 4 132% improved F-measure compared to classification with Word- Net, but generally low maximal F-measure 0.14 (Hearst Patterns), 0.23 (WordNet), 0.27 (TREC hypernyms), 0.33 (TREC hypernyms + coordinate terms), 0.36 (TREC + Wikipedia hypernyms + coordinate terms)

Heymann and Garcia-Molina

(2006) 4 centrality driven creation of noun hierarchies

Snow et al. (2006) 4 machine learning for patterns using WordNet, TREC, Wikipedia; 0.58 precision, 0.20 recall

Witschel (2005) 4 co-occurrence in large corpora; 11% to 14% accuracy Ryu and Choi (2006) 4∗ ∗review on four methods with recall and precision below 0.50

Table 2.14. Overview on the quality of taxonomy generation.The F-measure is usually below 0.50.

In information retrieval, quality is often measured as F-measure (F), the harmonic mean of precision and recall.

Ogren et al. (2004) analysed in an ontology-centric approach to taxonomy generation the compositional structure of Gene Ontology (GO) terms and found that many GO terms contain each other and many GO terms are derived from each other. For example, the term membrane [GO:0016020] has inner membrane [GO:0019866] as a direct sub-concept. This and similar knowledge can be used to automatically gener- ate new candidate terms following the observed patterns and induce the structure. We evaluated this in a small experiment Section 5.3 (Pattern-based relation extraction – Superstring prediction). Lee et al. (2006) used the taxonomic structure of the ontology to predict new terms including the parent child relations.

Snow et al. (2004, 2006) Instead of creating definitional patterns by hand, machine learning techniques help to learn these syntactic patterns from examples. Snow et al. illustrates that machine learning lead to an improvement of over 132% for finding hypernym relationships compared to simple Hearst patterns. The best setup finds hypernyms with an F-measure of 0.36 and uses training data from WordNet, TREC, and Wikipedia. Secondly, Snow et al. (2006) extend their previous work by an conditional model to judge on the likelihood of generated relations by maximising the conditional probability of relations. A relation is likely to be true if a syntactic pattern exists supporting the assignment. The approach was used to extend WordNet version 2.1 and reports 0.20 recall at 0.58 precision.

Taxonomy learning methods using statistical information

Caraballo (1999) creates hierarchies using syntactic (Hearst, 1992) as well as statistical information, here co-occurrence. The algorithm produces correct hyponyms in 33% of all cases. The evaluation is based on a sample of 10 nodes each dominating at least 20 nouns. The total tree contained 20,014 nouns which have been structured by 654 nodes. Up to three hypernyms where listed as “best” hypernyms for each node. Three human judges had to assess for each noun whether the hypernyms assigned to the corresponding nodes are correct. For 60% of the tested nouns at least one judge judged one hypernym as correct. Given the small test set the evaluation by Caraballo (1999) is not comprehensive. It was not evaluated how many of the nouns within a cluster were correct, just whether the hypernyms assigned to each cluster hold true for the nouns assigned to the cluster. Conclusion drawn by Cara- ballo are not generalisable for learning taxonomic relations. With the conclusion “.. that hypernym hierarchies of nouns can be constructed automatically from text with similar performance to semantic lexica built automatically for hand selected hypernyms.”, the authors compare to the pattern-based approach by Hearst (1992).

Sanderson and Croft (1999) avoided the use of clustering or training data and created concept hierarchies using co-occurrences of concepts (their lexical repre- sentations) in text. Half of the pairs obtained by co-occurrence testing fulfill some subsumption criterion.

Heymann and Garcia-Molina (2006) used statistical information for the extraction of subsumption relations from text corpora. In this method two terms are linked if the cosine similarity of their document vectors is above a threshold. The term, which is more central in the whole graph, becomes the parent, the other the child. The

Cosine similarity is a measure often used to compare text, where the similarity is the COSINE

SIMILARITY

cosine of the angle between two n dimensional vectors representing the texts. The algorithm has been described, but not evaluated by the authors. As evaluation for the usage of co-occurrence data this algorithm has been evaluated within this thesis in Section 5.4 (Results: Algorithm by Heymann et. al).

Witschel (2005) In one of the first large scale evaluations Witschel evaluated to what extent noun phrases can be related via subsumption relations to a hierarchy. The method identifies noun phrases with a pattern based approach using Part-Of- Speech tags, selects candidate terms based on frequency and locates them in a hierarchy by utilising co-occurrence features from large corpora and achieves in the evaluation a low accuracy of 14%. Even though a huge learning corpus (ca. 5 GB) was used the classification data was sparse. Only for 60% of the chosen example, a minimum of 10 similar words could be associated.

Formal concept analysis uses similarity measures to arrange concepts in a hierarchy (see also (Ganter et al., 2005)). On two domain examples Tourism and Finance the a FCA approach was evaluated by (Cimiano et al., 2005) and compared with KMeans

and hierarchical clustering. With F1 = 0.41 (Tourism) and F1 = 0.33 (Finance), FCA

outperformed all clustering methods in terms of F-measure. This is due to higher re-

compared to only O(n2₎ _{or O}₍_n2_{log n}₎ _{for KMeans and agglomerative clustering}

methods. Because partner in a subsumption relationship extracted with FCA can consist of as set of terms, the F-measure is calculated as defined by Maedche and Staab (2002) who uses Semantic Cotopy, a measure which averages the similarity between the set of two terms in different ontologies where a set contains all ancestors and descendants of the term. The authors average this over all terms in the learned and the reference ontology. Therefore the F-measure is mostly higher and not di- rectly comparable to methods extracting explicit term-term relationships.

Ryu and Choi (2006) compares four taxonomy learning methods and analysed the features for specificity and similarity in previous methods to select of optimal features to be used for taxonomy learning. Term specificity is a necessary condition for taxonomy learning, because specific terms tend to be locate in low level of a domain taxonomy. Term similarity is a necessary condition in taxonomy learning, because similar terms group close together in a taxonomy. Therefore it is highly

probable that term t1is an ancestor of t2in a taxonomy TD, if both are semantically

similar and t2is more specific than t1in the domain D.

Features for specificity of terms:

• Specadj– term t (a noun) is specific, if there are few adjectives modifying it (Cara-

ballo, 1999; Ryu and Choi, 2005)

• Specvarg – Verb-argument distribution is based on the co-occurrence of terms with

special verbs. A term is more specific, if it co-occurs frequently with the same verbs. E.g. ”protein” and ”increase”, ”activate”, ”inhibits”, ”binds”, etc. (Cimiano et al., 2005)

• Speccoldoc– Conditional probability of term co-occurrence regards a term ta to sub-

sume tb, if P(ta|tb) >P(tb, ta). Hence tbis more specific then ta.

• Specin – Inside-word information is used to measure specificity for multiword

terms. Indicates what component word which is highly associated with a term contributes specificity to the term.

• Specin/adj – harmonised similarity from Specin and Specadj to regard both inside

and outside information. Features for similarity of terms:

• If terms co-occur in similar documents, they are similar (Sanderson and Croft, 1999).

• If vectors of adjective patterns of terms are similar, the terms are similar (Ya- mamoto et al., 2005)

• If vectors of verb-argument dependencies are similar, the terms are similar (Cimi- ano et al., 2005).

Ryu and Choi compared four taxonomy learning methods and reported recall and precision of 0.50 or lower. It was tested whether the assumption holds, that in a valid parent-child relationship the specificity of the parent is lower that the speci-

ficity of the child. While Spec_adjshowed the highest precision, recall was very low as

usually there exist few modifications of nouns by adjectives. Regarding similarity it was observed, that taxonomy based similarity ratings are closest to human similarity ratings (correlation coefficient of 0.85).

In document UNIVERSIDAD RICARDO PALMA FACULTAD DE INGENIERÍA (página 32-36)