Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation
Eneko Agirre, Aitor Soroa
IXA NLP Group University of Basque Country
Donostia, Basque Contry [email protected]
Abstract
This paper presents the results of a graph-based method for performing knowledge-based Word Sense Disambiguation (WSD). The technique exploits the structural properties of the graph underlying the chosen knowledge base. The method is general, in the sense that it is not tied to any particular knowledge base, but in this work we have applied it to the Multilingual Central Repository (MCR, (Atserias et al., 2004)). The evaluation has been performed on the Senseval-3 all-words task (Snyder and Palmer, 2004). The main contributions of the paper are twofold: (1) We have evaluated the separate and combined performance of each type of relation in the MCR, and thus indirectly validated the contents of the MCR and their potential for WSD. (2) We obtain state-of-the-art results, and in fact yield the best results that can be obtained using publicly available data.
1. Introduction
Word Sense Disambiguation (WSD) is a key enabling- technology, the aim of which is to automatically choose the correct sense of a word in a context. Supervised WSD sys- tems are the best performing in public evaluations but they need large amounts of hand-tagged data, which is typically very expensive to build. On the other hand, knowledge- based WSD systems exploit the information present on a lexical knowledge base (LKB) to perform WSD, without using any further corpus evidence.
Traditional knowledge-based WSD system assign a sense to an ambiguous word by comparing each of its senses with those of the surrounding context. Typically, some seman- tic similarity metric is used for calculating the relatedness among senses(Lesk, 1986; McCarthy et al., 2004). One of the major drawbacks on these systems comes from the fact that the WSD process is done in a word-by-word basis, i.e., the words are disambiguated individually.
Recently, graph-based methods for knowledge-based WSD have gained much attention in the NLP community (Sinha and Mihalcea, 2007; Navigli and Lapata, 2007; Mihalcea, 2005). These methods use well-known graph-based tech- niques to find and exploit the structural properties of the graph underlying the LKB. Because the graph is analyzed as a whole, these techniques can help to find globally opti- mal solutions given the relations between entities, and dis- ambiguate large portions of text in one go.
In this paper, we present a novel graph-based method for performing unsupervised WSD. The method is general, in the sense that it is not tied to any particular lexical resource, but in this work we have applied it to the Multilingual Cen- tral Repository (MCR, (Atserias et al., 2004)). The MCR is based on the EuroWordNet architecture to represent tightly connected wordnets for various languages. It also contains thousands of automatically acquired relations. The main goal of this work is to show that the relations in the MCR are valuable for WSD, and that our WSD algorithm ob- tains state-of-the-art results compared to other more com- plex graph-based algorithms.
The WSD method works as follows. Given an input con- text, the method first explores the whole LKB and finds a subgraph which is particularly relevant for the words of the context. Then, a random-walk algorithm is executed over the subgraph for obtaining a ranking of vertices. As a re- sult, every word of the context is attached to the highest ranking concept among its possible senses.
The paper is organized as follows. We first describe our graph based method in Section 2. Section 3 presents the Multilingual Central Repository. Section 4 show experi- mental setting and results. Section 5 compares our system with related experiments on graph-based WSD. Finally, we draw conclusions and further work.
2. Graph-based techniques for WSD
In this section we will describe our general framework for graph-based WSD. The method is general and none of the processes of the method are particular to any particular knowledge base. However, in our work we have used the MCR as the source of lexical-semantic knowledge.
In a first step, we build a graph GMCRwhich represents the MCR knowledge base: each synset (concept) of the MCR is represented by a node vion the graph, and each relation between concepts viand vjis represented by an edge ei,j. Given an input text, we extract the list Wi i = 1 . . . m of content words (i.e. nouns, verbs, adjectives and ad- verbs), where each of them will have a set of synsets as- sociated in the MCR. Let Synsetsi= {v1, . . . , vim} be the im associated synsets of word Wi in the MCR. Note that monosemous words will be related to only one synset in the MCR, whereas polysemous words may be attached to several synsets. As a result of the disambiguation process, for each word in the sentence we will select the sense in the MCR with the highest ranking synset in the graph. We will now explain how we create subgraph and the method to rank the synsets contained in it.
Name Relations #synsets #relations M16 WN1.6, REL2.0, XNET, sPref, sCooc 99,634 1,651,445 M16 wout sPref WN1.6, REL2.0, XNET, sCooc 99,634 1,519,833 M16 wout sCooc WN1.6, REL2.0, XNET, sPref 99,632 798,453 M16 wout Xnet WN1.6, REL2.0, sPref, sCooc 99,238 1,169,300
M16 wout Semcor WN1.6, REL2.0, XNET 99,632 637,290
M16 wout WXnet sPref, sCooc 27,336 1,024,698
M17 WN1.7, XNET 109,359 620,396
Table 1: Groupings of the relations in the MCR used in the paper.
Creating a disambiguation subgraph for a given context
The main idea of our method is that to extract the subgraph of GMCRwhose vertices and relations are particularly rel- evant for a given input context. We call such a subgraph a
“disambiguation subgraph” GD, and we build it in the fol- lowing way. For each word Wi in the input context and each synset vi ∈ Synsetsi, we perform a standard breath- first search (BFS) over GMCRstarting at node vi. Each run of the BFS calculates the minimum distance paths between viand the rest of concepts of GMCR. In particular, we are interested in the minimum distance paths between vi and the concepts associated to the rest of the words in the con- text, vj ∈ S
j6=iSynsetsj. Let mdpvi be the set of these shortest paths.
We repeat this BFS computation for every synset of every word in the input context, storing mdpvi accordingly. At the end, we obtain a set of minimum length paths each of them having a different concept as a source. We then build the disambiguation graph GDwhich is just the union of the vertices and edges of the shortest paths, GD = Sm
i=1{mdpvj/vj∈ Synsetsi}.
The disambiguation graph GD is thus a subgraph of the original GMCR graph obtained by computing the shortest paths between the synsets of the words co-occurring in the context. Thus, we can assume that it captures the most rel- evant concepts and relations in the MCR for the particular input context.
Exploiting structural properties of the disambiguation graph
Once the GD graph is built, we compute the PageR- ank (Brin and Page, 1998) algorithm over it. The intuition behind this step is that the vertices representing the correct synsets will be more relevant in GDthan the rest of the pos- sible synsets of the context words, which should have less relations on average and be more isolated.
PageRank is an iterative algorithm that ranks all the vertices of a graph according to their relative importance within the graph, following a random-walk model. In this model, a link between vertices v1and v2means that v1recommends v2. The more vertices recommend v2, the higher the rank of v2will be. Furthermore, the rank of a vertex depends not only on how many vertices point to it, but on the rank of these vertices as well.
Specifically, let G = (V, E) be a graph with the set of ver- tices V and set of edges E. For a given vertex vi, let In(vi) be the set of vertices pointing to it, and let djthe degree of
vertex vj. The rank of viis defined as:
P (vi) = (1 − α) + α X
j∈In(vi)
1 dj
P (vj)
where 0 ≤ α ≤ 1. α is called the damping factor and models the probability of a web surfer standing at a vertex to follow a link from this vertex (probability α) or to jump to a random vertex in the graph (probability 1 − α). The factor is usually set at 0.85.
The algorithm initializes the ranks of the vertex with a fixed value (usually N1 for a graph with N vertices) and iterates until convergence below a given threshold is achieved, or, more typically, until a fixed number of iterations are exe- cuted. Note that the convergence of the algorithms doesn’t depend of the initial value of the ranks.
After running the algorithm, the vertices of GDare ordered in decreasing order according to its rank. Finally, the dis- ambiguation step is performed by assigning to each word Withe associated concept in Synsetsiwhich has maximum rank (in case of ties, we assign all the concepts with maxi- mum rank).
3. Multilingual Central Repository
The Multilingual Central Repository (Atserias et al., 2004) is a lexical knowledge base built within the MEANING project1, and acts as a multilingual interface for integrating and distributing all the knowledge acquired in the project.
The MCR constitutes a natural multilingual large scale lin- guistic resource for a number of semantic processes that need large amounts of multilingual knowledge to be effec- tive tools.
The current version of the MCR contains more than 1, 500, 000 semantic relations between synsets, most of them acquired by automatic means. This represents al- most seven times more relations than the Princeton Word- Net (Fellbaum, 1998) (235, 402 unique semantic relations in WN 3.0).
The MCR follows the model proposed by the EWN project, whose architecture includes the Inter-Lingual-Index (ILI), a Domain ontology and a Top Concept ontology.
The current MCR integrates: (i) the ILI based in WN1.6, including EWN Base Concepts, the EWN Top Concept ontology, MultiWordNet Domains (MWND) and the Sug- gested Upper Merged Ontology (SUMO); (ii) Local WNs connected to the ILI, including English WN 1.5, 1.6, 1.7, 1.7.1, Basque, Catalan, Italian and Spanish wordnets; (iii)
1http://nipadio.lsi.upc.es/nlp/meaning
Large collections of semantic preferences, acquired both from SemCor and from BNC; (iv) disambiguated glosses from the eXtended WordNet (Harabagiu and Moldovan, 2000); (v) Instances, including named entities.
The MCR is mainly based on WordNet 1.6. The reason is that most of the resources integrated on it where built linked to that version of WordNet. Unfortunately new versions of WordNet have been produced in later years, and although automatic mapping technology exists (Daude et al., 2000) and porting the relations among versions is possible, some errors are introduced, as we will see.
In this work we have used the following relations from the MCR:
• WN1.6: English WordNet 1.6 synsets and relations
• WN1.7: English WordNet 1.7 synsets and relations
• REL2.0: English WordNet 2.0 relations (to be added to WN1.6)
• XNET: eXtended WordNet (gold, silver and normal)
• sPref: Selectional preference relations
• sCooc: Coocurrence relations
The latter two types of relations (sPref and sCooc) are ex- tracted from Semcor, a semantically hand-tagged corpus (Miller et al., 1993). They are thus essentially different from the other relations of the MCR; because they are ex- tracted from a hand-tagged corpus, the system benefits from a “supervised” kind of information when exploiting sPref or sCooc relations in our algorithm.
One of our main objectives is to measure the impact of the different relations in the MCR on the WSD task. There- fore, we have grouped different MCR relation types to- gether, and tried the same algorithm with each of them. Ta- ble 1 shows the relation groups and their sizes (number of synsets and relations). We basically have two main groups of graphs: those based on version 1.6 of WordNet (all of which start with M16) and those based on version 1.7 of WordNet (starting with M17). Version 1.7 is interesting be- cause it is the one used in the evaluation, cf. Section 4..
4. Experiments
We have applied our graph-based technique to the Senseval- 3 all words dataset (Snyder and Palmer, 2004), which is based on version 1.7 of Wordnet. For each sentence to be disambiguated we build a context of at least 20 words, tak- ing the sentences immediately before and after it in the case that the original sentence was too short.
We have tried different sets of relations in several runs of the algorithm. Table 2 summarizes the performance of the algorithm for the different MCR relations. The table is divided in two sections, depending whether relations ex- tracted from Semcor are used (“Semi supervised”) or not (“Unsupervised”).
We can see that adding supervised relations achieve the best overall results. In particular, the results indicate that coocurrence relations between synsets are a power- ful source of information. If we discard coocurrence rela- tions from the LKB the performance of the system drops
Relations All Noun Verb Adj. Adv.
Semi supervised
M16 57.30 62.30 49.00 62.40 92.90
M16 wout sPref 57.90 63.10 49.80 61.80 92.90 M16 wout sCooc 53.00 58.10 44.20 58.30 92.90 M16 wout Xnet 57.60 63.10 49.60 61.00 92.90 M16 wout WXnet 55.30 58.70 48.70 60.80 85.70
Unsupervised
M16 wout semcor 53.70 59.50 45.00 57.80 92.90
M17 56.20 61.60 47.30 61.80 92.90
Table 2: Results of Senseval-3 All Words as recall
more than 4 percentage points (as can be seen in row M16 wout sCooc in Table 2). On the other hand, selec- tional preference relations don’t seem to be as useful as coocurrence relations.
Somehow surprisingly, the system performs only slightly worse when using only supervised information (i.e. with- out any hierarchical relations of WordNet), as indicated by the row M16 wout WXnet. Because the size of the LKB when considering only supervised relations is one third of the LKB with all relations (see Table 1), we would expect a heavy coverage penalty on the algorithm, and thus a per- formance drop in the overall recall. However, this drop is not produced, which indicates that large parts of WordNet itself are never considered when applying the WSD algo- rithm to this data set, and that the relations extracted from SemCor contain the most important synsets that need to be considered.
Regarding the unsupervised results (row M16 wout semcor), the performance drops as expected, but the obtained results are of great merit, as no hand- tagged corpora are being used. The experiment we did with WordNet 1.7 (row M17), shows that it yields better results than WordNet 1.6. The reason, may lye on the fact that the Senseval 3 all words data set is tagged using WordNet 1.7.1 synsets, and thus the mapping step which is used with the LKBs based on WordNet 1.6 introduces considerable noise.
5. Comparison to related work
We have compared our results with similar work which also use graph techniques for performing unsupervised, knowl- edge based WSD, namely (Mihalcea, 2005; Sinha and Mi- halcea, 2007; Navigli and Lapata, 2007). Table 3 shows a comparison of their results with ours. We only depict the results of out best unsupervised experiment M17, as the rest of the systems are also of unsupervised nature. A base- line which consists on selecting the most frequent sense in Semcor of every word is also included, as well as the result of the best system of Senseval-3 all word task (GAMBL).
Note that GAMBL is a supervised system which learns from Semcor information.
The TexRank algorithm (Mihalcea, 2005) for WSD creates a complete weighted graph (e.g. a graph where every pair of distinct vertices is connected by an weighted edge) formed by the synsets of the words in the input context. The weight of the links joining two synsets is calculated by executing Lesk’s algorithm between them, i.e., by calculating an over-
lapping factor of the words in the sense’s definition glosses.
Once the complete graph is built, PageRank algorithm is executed over it and words are assigned to the most relevant synset. Their results on Senseval-3 dataset are depicted in row Mih05 on Table 3.
(Sinha and Mihalcea, 2007) extends the previous work by using a collection of semantic similarity measures when assigning a weight to the links across synsets. They also compare different graph-based centrality algorithms to rank the vertices of the complete graph. Using different simi- larity metrics for different POS types and a voting scheme among the centrality algorimth ranks yields the results on row Sin07 on Table 3. Here, the Senseval-3 corpus was used as a development data set, and we can thus see the re- sults as the upperbound of their method. Both (Mihalcea, 2005) and (Sinha and Mihalcea, 2007) use WordNet as a LKB, but neither of them specify which particular version did they use.
We can see in Table 3 that our method clearly outperforms both Mih05 and Sin07, which indicates that the initial ex- traction of a LKB’s subgraph is an important step to per- form. We can see that storing intermediate LKB nodes from the shortest paths between synsets adds useful infor- mation when performing WSD. The results of various in- house made experiments with complete graphs a-la (Mi- halcea, 2005) also confirm this observation. Note also that our method is simpler than the combination strategy used in (Sinha and Mihalcea, 2007), and that we did not perform any parameter tuning as they did.
The work presented here is very similar to (Navigli and La- pata, 2007). They also perform a two stage process for WSD, the first one consisting on the extraction of a sub- graph from the LKB which better suits the concepts in- volved in a particular input context. Then, they study dif- ferent graph-based centrality algorithms deciding the rele- vance of the nodes on the subgraph. The algorithm which performs best yields the results shown in row Nav07 on ta- ble 3.
The main difference between the method used in (Navigli and Lapata, 2007) and our method lies on the initial method for extracting the context subgraph. Whereas we rely on shortest paths between word synsets, they apply a depth- first search algorithm over the LKB graph, and restrict the deep of the subtree to a value of 3. As for the underly- ing LKB, they use WordNet 2.0 enriched with several new relations as described in (Navigli and Velardi, 2005). Un- fortunately, those new relations are not publicly available, so we can’t perform a direct comparison.
Navigli and Lapata don’t report overall results and there- fore, we can’t directly compare our results with theirs.
However, we can see that a on a POS-basis evaluation our results are rather similar for nouns and adjectives, and per- form better for verbs.
6. Conclusion and further work
In this paper we have presented the results of a graph-based method for performing knowledge-based Word Sense Dis- ambiguation (WSD). The technique exploits the structural properties of the graph underlying the chosen knowledge base. The method is general, in the sense that it is not tied to
System All Noun Verb Adj. Adv.
Mih05 52.2 - - - -
Sin07 52.4 60.45 40.57 54.14 100
Nav07 - 61.9 36.1 62.8 -
M17 56.20 61.60 47.30 61.80 92.90
MFS 60.9-62.4 - - - -
GAMBL 65.1 - - - -
Table 3: Comparison with related work on Senseval-3 All Words dataset. Sin07: Senseval-3 corpus used was develop- ment data set. Nav07: the authors report the results only by POS. M17: our best unsupervised experiment. MFS: base- line performance varies depending on treatment of multi- words and hyphenated words. GAMBL: best supervised system.
any particular knowledge base, but in this work we have ap- plied it to the Multilingual Central Repository (MCR, (At- serias et al., 2004)).
The evaluation has been performed on the Senseval-3 all- words task (Snyder and Palmer, 2004). One of the contribu- tions of this paper is that we have evaluated the separate and combined performance of each type of relation in the MCR, and thus indirectly validated the contents of the MCR and their potential for WSD. The analysis shows that all the re- lations in the MCR are valuable for performing WSD, and that the relations coming from hand-tagged corpora are the most valuable, as could be expected. The version of Word- Net is highly relevant. In fact, using the same version as of the test corpus proved to be a key issue, as it allowed to obtain better results.
The other main contribution of this paper is to show that our graph-based WSD system is competitive with the current state-of-the-art, and in fact yields the best results that can be obtained using publicly available data.
Acknowledgments
This work has been funded by the Basque Government (project SAIOTEK S-PE06UN31).
7. References
J. Atserias, L. Villarejo, G. Rigau, E. Agirre, J. Carroll, B. Magnini, and P. Vossen. 2004. The meaning multi- lingual central repository. In In Proceedings of GWC, Brno, Czech Republic.
S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7).
J. Daude, L. Padro, and G. Rigau. 2000. Mapping Word- Nets using structural information. In 38th Anual Meet- ing of the Association for Computational Linguistics (ACL’2000), Hong Kong.
C. Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
S. M. Harabagiu and D. I. Moldovan. 2000. Enriching the wordnet taxonomy with contextual knowledge acquired from text. pages 301–333.
M. Lesk. 1986. Automatic sense disambiguation using ma- chine readable dictionaries: how to tell a pine cone from an ice cream cone. In SIGDOC ’86: Proceedings of the
5th annual international conference on Systems docu- mentation, pages 24–26, New York, NY, USA. ACM.
D. McCarthy, R. Koeling, J. Weeds, and J. Carroll. 2004.
Finding predominant word senses in untagged text. In ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 279, Morristown, NJ, USA. Association for Computational Linguistics.
R. Mihalcea. 2005. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In Proceedings of HLT05, Mor- ristown, NJ, USA.
G.A. Miller, C. Leacock, R. Tengi, and R.Bunker. 1993. A semantic concordance. In Proc. of the ARPA HLT work- shop.
R. Navigli and M. Lapata. 2007. Graph connectivity mea- sures for unsupervised word sense disambiguation. In IJCAI.
R. Navigli and P. Velardi. 2005. Structural semantic in- terconnections: A knowledge-based approach to word sense disambiguation. IEEE Trans. Pattern Anal. Mach.
Intell., 27(7):1075–1086.
R. Sinha and R. Mihalcea. 2007. Unsupervised graph- based word sense disambiguation using measures of word semantic similarity. In Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA, USA.
B. Snyder and M. Palmer. 2004. The english all-words task. In Proc. of SENSEVAL.