Cross-lingual phrase sense disambiguation with syntactic dependency trees

(1)

A Senior Project

Submitted To The Department Of Computer Engineering Of Universidad de los Andes

In Partial Fulfillment Of The Requirements For The Degree Of

Computer Engineer

(2)

(3)

In this senior project I describe the construction of atranslation assistance system that is intended to map L1 fragments, written in a L2 context, to their correct L2 translation. My approach used a bilingual parallel corpus, a syntactic features ex-traction mechanism and a memory-based statistical classification algorithm. This system participated in the fifth task of the 2014 International Workshop on Seman-tic Evaluation and ranked 2nd and 3rd among the 6 parSeman-ticipating systems that used the English-Spanish dataset.

(4)

(5)

1 Introduction 7 2 Theoretical Framework 9 3 Method Description 11

3.1 Overall Operation . . . 11

3.2 Parallel Corpus Selection . . . 12

3.3 Parallel Corpus Preparation . . . 13

3.4 Test Data Preparation . . . 14

3.5 Syntactic Feature Extraction . . . 14

4 Submitted System Description 17 4.1 Results . . . 17

4.2 Error Analysis . . . 17

5 Discussion 19

(6)

(7)

Introduction

An L2 writing assistant is a tool intended for language learners who need to im-prove their writing skills. This tool lets them write a text in L2, but fall back to their native L1 whenever they are not sure about a certain word or expression. In these cases, the assistant automatically translates this text for them [1].

Although it is a machine translation requirement, it is best fulfilled by a cross-lingual Word Sense Disambiguation (WSD) approach, which takes context into acount by means of syntactic features used in a machine learning setting. The main differences between this and previous approaches to cross-lingual WSD are the bilingual nature of the input sentences (see section 3.4) and the generalization of the translation tarto target phrases, rather than single words.

The remainder of this dissertation is organized as follows. Section 3 describes the proposed method. The submitted systems descriptions, the obtained results and an error analysis are discussed in section 4. In section 5 we present a brief discussion about the results, and finally in section 6 we make some concluding remarks.

(8)

(9)

Theoretical Framework

Natural Language Processing (NLP) is a multidisciplinary field which studies the symbolic manipulation of human language by means of automatic mecanisms. In many contexts it is synonymous with computational linguistics, and their differen-tiation is more about epistemological aspects.

Although it is a rather ill-known field of computer science, it dates back almost the same length as computation itself. Its post-war emergence is related to Warren Weaver’s Memorandum [2], written and spread among Weaver’s acquaintances as early as 1949. In this document, the author makes a call for the experts in the then-emerging field of informatics, in order to use machines for translation between human languages. His reasoning about the feasibility of this project is based on a comparison between cryptography and human languages. Thus, according to him, a Russian text can be thought of as merely an encrypted English text according to a set of patterns known as Russian grammar; as can be seen, the incipient cold war context is easily tractable from this text.

The 1950s were a decade of optimism around NLP progress, especially ma-chine translation, its main and almost only application at the time. This very opti-mism also precipitated a huge wave of disappointment when, in 1966, the ALPAC, a federal commision from the American government, made a report that concluded that no relevant progress had been made in this area, nor was it likely to come anytime soon [3].

This situation started to change favorably again for NLP at the onset of the 1980s, when the personal computer era began, which widened the perceived use-fulness of NLP applications with regard to commercial pruposes for an expanding market of millions of users. Furthermore, the development of multiple computable linguistic theories enabled the development of other applications of NLP besides machine translation, such as spell and grammar checking, question answering,

(10)

in-10 CHAPTER 2. THEORETICAL FRAMEWORK

formation extraction and speech recognition, among others [4].

With the arrival of the World Wide Web in the 1990s, a new paradigm of NLP became increasingly popular; namely, statistical NLP. This paradigm has yielded the best results to date [5]. It consists in finding general usage patterns about a specific language in a big linguistic sample, namely a corpus. It is from within this paradigm that Statistical Machine Translation emerges. The present work is part of this paradigm.

(11)

Method Description

3.1 Overall Operation

The system we built uses machine learning techniques to find the most appropriate translation of a target phrase in a given context. It receives an input as in (1) and yields an output as in (2).

(1) No creo que ella is coming. (2) No creo que ella esté viniendo.

It does so on the basis of a syntactic selection of context features (Cfr. section 3.5), a bilingual parallel corpus and a classifier built using the Tilburg Memory-Based Learner [6], making use of the IB1 algorithm, an implementation of thek-nearest neighbour classifier.

The operation of the system is divided into three stages. First, in the prepro-cessing stage, the input sentences are tokenized using the Freeling tokenizer [7] and sorted according to their target phrases’ lexicographical order. Second, in the processing stage, each sentence is analyzed and translated sequentially. Finally, in the synthesis stage, an output corpus is generated from the processed target phrases and their original context. The remainder of this section will focus on the process-ing stage.

The processing of each sentence consists of several steps. In the first step, the L1 target phrase is searched for in a phrase index, which is an index of the appear-ances of each phrase in all the sentences of a parallel bilingual corpus. Given an L1 phrase, a binary search algorithm iterates through the phrase index and returns an array of line offsets1of the corpus. Then, a multi-threaded subroutine reads the

(12)

12 CHAPTER 3. METHOD DESCRIPTION

word-aligned bilingual corpus and extracts all the referenced sentences.

Thus, for each input sentence, a set of example bilingual word-aligned sen-tences is extracted from the corpus, as shown in Table 3.1. Relevant features are extracted according to a syntactic analysis as explained in section 3.5, and written to text files in the C4.5 format. The features extracted from the example sentences, as well as the L2 translations of the target phrases in each sentence, are used as the training set for TiMBL, while the features extracted from the input sentence are used as its (singleton) test set.

Input sentence Example sentences No creo que las necesidades

afectivas de las personas estén necesariamentelinkedal matrimonio.

He said Boyd already linked him to Brendan.

Dijo que Boyd ya le había rela-cionado con Brendan.

The three things are inextrica-bly linked, and I have the for-mula right here.

Las tres cosas están es-trechamente vinculadas, y tengo la fórmula aquí.

.. . 681 more Table 3.1: Input sentence and example sentences

The L2 translations of each target phrase in the example sentences are used as the classes for the training set, in order to turn a bilingual disambiguation problem into a machine learning classification problem. TiMBL learns how to classify the training feature vectors into their corresponding classes and then predicts the class for the test set feature vector, i.e. its most likely translation.

3.2 Parallel Corpus Selection

As no training corpus was given prior to developing this system, finding and pro-cessing the most suitable corpus for this task was paramount. As the purpose of this system is to help language students, the corpus needs to account for simple yet correct everyday speech.

That is why we opted to use the 70-million sentences OpenSubtitles.org corpus compiled by the Opus Project [8], which includes many informal everyday utter-ances, at the expense of a less accurate translation quality (the EPPS corpus [9] was very useful in the developing stages of this system) and were evaluated by means of a manually constructed 500-sentence trial corpus extracted from Boudoin College

(13)

Although the use of this training corpus yielded over 95% recall on the trial corpus, only 80% of the trial sentences had enough (>100) training examples in order to produce a quality translation. To solve this issue, an ad-hoc corpus com-pilation mechanism was created by using the Linguee.com search engine; a web crawler retrieved the results from Linguee.com for each target phrase in the test (or trial) corpus used, and compiled them in a collection of files which were then prepared following the same procedure as the OpenSubtitles.org corpus, which are described in the next section.

3.3 Parallel Corpus Preparation

The corpus preparation procedure consisted of several steps. The first step involved breaking the corpus into many one-million-sentence length files in order to ease parallelization, and cleaning the corpus with the Moses cleaning script [11]. All the ellipses (. . . ) were also removed as they are used in subtitle corpora to break long sentences. Next, the corpus was tokenized and tagged using FreeLing [7], as exemplified in Table 3.2. The FreeLing tool offers two PoS taggers: a Hidden Markov Model Tagger, which is faster, and a 2- or 3-gram Relax Tagger, which can be tuned with customized constraints. Although the Relax Tagger yielded better tagging accuracy, the PoS tagging was done using Freeling’s HMM tagger because of its speed given the size of the corpus.

0 1 2 3 4 5 6

Forms Las tres cosas están estrechamente vinculadas . Lemmas el 3 cosa estar estrechamente vincular . PoS Tags DA0FP0 Z NCFP000 VAIP3P0 RG VMP00PF Fp Syn Tags espec:1 espec:2 subj:3 co-v:7 espec:5 att:3 ?:7

Table 3.2: Tagging of the sentenceLas tres cosas están estrechamente vinculadas. The syntactic tags are a novel feature we are introducing. They are lineariza-tions of syntactic dependency trees. These trees were built by Freeling’s Txala Parser [12] and were introduced as individual tags in a sentence analysis by pars-ing the tree and mapppars-ing its leaves with their correspondpars-ing order in the source sentence. Then, each leaf’s label and parent number was extracted. For the root, the special parent tag ‘S’ was used.

(14)

was then combined with the tagged version of the corpus, and all the one-million-sentence length files were transformed back into one large text file, which is the resulting corpus.

Finally, a phrase index was made following a SMT phrase extraction algorithm [14], [15]. Unlike SMT phrase tables, each phrase was extracted along with a reference to its original location in the corpus. Then, all occurencies of a same phrase were put together to form a list of corpus locations for each given phrase.

3.4 Test Data Preparation

The test data for this task is composed of bilingual input sentences, which poses a signifiant challenge to syntactic parsing. In order to overcome this issue, we developed a two-stage process in which the input sentences were first processed and classified with shallow features (word forms). The resulting translated sen-tences were then syntactically parsed, and this information was used for the feature selection algorithm in the second stage.

3.5 Syntactic Feature Extraction

What differentiates this Machine Learning approach to translation from a dictio-nary approach is the role given to the context of each target phrase. Thus, trans-lations are not only dependent on the words within each target fragment, but also on the words that surround it. The selection of these surrounding words and their relevant features into a feature vector for machine learning purposes is called “Fea-ture Extraction” and it can be expressed by the following question: Given a large enough set of example translations for a given target phrase in an L2 context, what is the most adequate selection of relevant features in order to allow the machine learning algorithm to select the best available translation?

The WSD literature commonly makes a distinction between local and global context features [16], which implicitly follows a heuristic that assigns surrounding words more importance than the rest of the sentence in determining the correct translation. Usually, a window of words is selected around the target from which to extract the local context features, while global context features are extracted from words of the sentence deemed relevant [17].

There is a linguistic explanation as to why surrounding words play a significant role in determing the target’s translation. Often, these words have a direct depen-dency relation with the target. Indeed, physical closeness is an approximation of syntactic relatedness. What we propose in this paper is that the relevance of the

(15)

Unlike [18], what we propose here is using syntax as a featureselector, rather than as a feature itself.

Instead of defining a local and a global set of relevant words, we selected a single set of relevant words according to their syntactic relation to the target phrase. This set consisted of all the children of the target words, and the parents of the main target words. The main target words are the subset of words with the highest number of (nested) children within the target phrase. Table 3.3 features the rules used for selecting the relevant words.

Situation Rule Example

One of the target words is a subject.

Include anysiblingwhich is an auxiliar or modal verb.

Our catquierecomerse la en-salada.

The parent of one of the main target words is a coordinative conjunction.

Include its closestsibling. No quería nieat, nidormir.

The parent of one of the main target words is a relative pro-noun.

Include itsgrandparent. Nocreoque ellais coming.

One of the target words is an attribute.

Include any sibling which is subject.

Mistíasestánvery tired.

Table 3.3: Relevant word selection rules

This Feature Extraction method uses the dependency labels as a means of se-lecting only relevant examples. Take for instance the example sentences in table 3.1. Given that the target word is an attribute, the subject is included as a relevant feature, as per the last rule in table 3.3. Any example sentence in which there is no subject as the sibling of the target word (as is the case for the first example sentence in table 3.1) will have an empty feature, which increases its likelihood of not being

(16)

creo

no que

estén

necesidades

las afectivas de

personas

las

necesariamente linked a

matrimonio

el

Figure 3.1: Dependency tree for the sentenceNo creo que las necesidades afectivas de las personas estén necesariamentelinkedal matrimonio.

(17)

Submitted System Description

We submitted three result sets for the English-Spanish language pair. Two of them were submitted for the ‘Best’ evaluation type, and the other one was submitted for the ‘out-of-five’ evaluation type. The difference between these two evaluation types is that out-of-five evaluation expects up to five different translations for every target phrase, while ‘best’ only accepts one. The evaluation metrics include accu-racy and recall, and also a word-based special type of accuaccu-racy, which takes into account partially correct translations.

Of the two runs submitted in the ‘Best’ evaluation type,Run1-best(see table 4.1) used our proposed syntactic feature extraction method, whileRun2-bestused a regular 2-word window around the target phrase and was used as a control group for our method. On the other hand, Run1-oof combined the two methods above with differentkvalues.

4.1 Results

The test data consisted in 500 sentences written in Spanish, with target English phrases. The official results obtained by our runs are shown in Table 4.1.

Our control run,Run2-best, yielded slightly better results than our experimen-tal run,Run1-best. This means that our method of syntactic features extraction did not improve translation quality.

4.2 Error Analysis

By analizing our results, we detected three kinds of recurrent errors. The first are related to verb morphology, in which a single English verbal form corresponds to

(18)

18 CHAPTER 4. SUBMITTED SYSTEM DESCRIPTION

Run Recall Accuracy Word Accuracy Rank (runs) Rank (systems)

Run1-best 0.993 0.721 0.794 5 2

Run2-best 0.993 0.733 0.809 4 2

Run1-oof 0.993 0.823 0.88 6 3

Table 4.1: Official results

many Spanish verbal forms. In these cases, our system often outputs an infinitive form or a past participle instead of a finite verb.

The second kind of errors we detected are incomplete translations. In these cases, a single word in English has a multiword Spanish translation, but our system often outputs a single-word translation.

Errors of the third kind are related to English words with multiple possible parts of speech, as ‘flood’, which can be a noun but also a verb. Our system tends to output nouns instead of verbs and vice-versa.

(19)

Discussion

There are two main reasons as to why the syntactic feature extraction method did not work. The first reason is related to the nature of the task; the second is related to the scope of the method.

The fact that this task involved analyzing sentences partly written in two lan-guages made syntactic analysis extremely difficult as dependencies span all over the bilingual sentence. The best solution we found for this, was to divide the op-eration of the system in two stages, where the first one did not involve syntactic dependencies and provided a working translation, and the second one used this first translation to perform a syntactic analysis and then rerun the classification step. This, however, favoured error propagation. Although translation quality did improve between the two stages, there were many cases in which a bad initial trans-lation involved a bad syntactic analysis, which in turn meant a bad final transtrans-lation. A more sophisticated version of his method was initially developped for the English-Spanish language pair and involved several language-specific rules. How-ever, we decided to make this method language-independent, so we simplified it to its actual version. This simplified version uses syntactic dependencies as feature selectors, but the features themselves are regular lemma/PoS combinations, which is not always the best feature choice.

(20)

(21)

Conclusions and Further Work

Syntactic dependency relations are an important means of analyzing the internal structure of a sentence and can successfuly be used to improve the feature selection process in WSD. However, syntactic parsing is far away from optimal in Spanish, a fortiori if it involves sentences written in two languages. Nonetheless, we hope that this work encourages others to carry on research on the importance of syntax for any kind of natural language processing task.

We envision four different paths of improvement and further work along this line of research. In the first place, further research on Spanish syntax and Span-ish disambiguation mechanisms is badly needed. Secondly, a product of Statisti-cal Machine Translation could be very useful for this kind of cross-lingual WSD, namely, language models. A language model is a statistical tool that extracts lin-gustic patterns from very large monolingual corpora and helps output better results in the target language when performing a translation. It is possible that such a tool, if integrated correctly, could yield impressive improvements over the current results. Third, research in word-level alignment for bilingual corpora is one of the most important aspects of cross-lingual WSD and it is often overlooked. Indeed, a very important part of the disambiguation process happens in the alignment. Better alignment methods are of extreme usefulness.

Finally, a better language analysis framework is required in order to improve the accuracy of language analysis tools, which are at the heart of a good disam-biguation process, and of most –if not all– NLP applications. In the field of data mining, text is often thought of as “unstructured” data. However, this approach fails to recognize syntax as a very useful structure for text mining, and instead tries to find patterns through naïve statistical means. That is a mistake that needs to be corrected.

(22)

(23)

I would like to specially thank Silvia Takahashi of Universidad de los Andes, and Julia Baquero and Sergio Jiménez of Universidad Nacional de Colombia for their continued advice and support. We would also like to thank Pedro Rodríguez for his development of the Linguee crawler, María De-Arteaga, Alejandro Riveros and David Hoyos for their useful suggestions in the conception and development of this project, and the rest of the UNAL-NLP team for their interest and encouragement.

(24)

(25)

[1] M. van Gompel, I. Hendrickx, A. van den Bosh, E. Lefever, and V. Hoste, “Semeval-2014 task 5: L2 writing assistant,” inProceedings of the 8th In-ternational Workshop on Semantic Evaluation (SemEval-2014), (Dublin, Ire-land), Aug. 2014.

[2] W. Weaver, “Translation,” in Machine translation of languages: fourteen essays(W. N. Locke and A. D. Booth, eds.), pp. 15–23, Cambridge, Mas-sachusetts: Technology Press of the Massachusetts Institute of Technology, 1955.

[3] W. J. Hutchins,Machine Translation: past, present, future. New York, USA: Halsted Press, 1986.

[4] I. Bolshakov and A. Gelbukh, Computational Linguistics: Models, Re-sources, Applications. Ciencia de la Computación, México DF, México: Fondo de Cultura Económica, 2004.

[5] C. D. Manning and H. Schütze,Foundations of statistical natural language processing. Cambridge, Massachusetts: MIT press, 1999.

[6] W. Daelemans, J. Zavrel, K. Van der Sloot, and A. Van den Bosch, “Timbl: Tilburg memory-based learner. reference guide,” 2010.

[7] L. Padró and E. Stanilovsky, “Freeling 3.0: Towards wider multilinguality,” inProceedings of the Language Resources and Evaluation Conference, (Is-tambul, Turkey), pp. 2473–2479, May 2012.

[8] J. Tiedemann, “Parallel data, tools and interfaces in OPUS.,” inProceedings of the Language Resources and Evaluation Conference, (Istambul, Turkey), p. 2214–2218, May 2012.

[9] P. Lambert, A. Gispert, R. Banchs, and J. B. Mariño, “Guidelines for word alignment evaluation and manual alignment,”Language Resources and Eval-uation, vol. 39, pp. 267–285, Dec. 2005.

(26)

26 BIBLIOGRAPHY

[10] M. Rosamond, L. Domínguez, M. Arche, F. Myles, and E. Marsden, “SPLLOC: a new database for spanish second language acquisition research,” EUROSLA Yearbook, vol. 8, p. 287, Jan. 2008.

[11] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, and R. Zens, “Moses: Open source toolkit for statistical machine translation,” inProceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, p. 177–180, 2007.

[12] M. Lloberes, I. Castellón, and L. Padró, “Spanish FreeLing dependency grammar.,” inLREC, vol. 10, p. 693–699, 2010.

[13] F. J. Och and H. Ney, “A systematic comparison of various statistical align-ment models,”Computational linguistics, vol. 29, no. 1, p. 19–51, 2003. [14] M. Collins, “Phrase-based translation models,” Apr. 2013.

[15] W. Ling, T. Luıs, J. Gra\cca, L. Coheur, I. Trancoso, and I. Lisboa, “Towards a general and extensible phrase-extraction algorithm,” inIWSLT’10: Interna-tional Workshop on Spoken Language Translation, p. 313–320, 2010. [16] D. Martinez and E. Agirre, “Decision lists for english and basque,” inThe

Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, p. 115–118, Association for Computational Lin-guistics, 2001.

[17] M. van Gompel, “UvT-WSD1: a cross-lingual word sense disambiguation system,” inProceedings of the 5th international workshop on semantic eval-uation, (Uppsala, Sweden), p. 238–241, 2010.

[18] D. Martínez, E. Agirre, and L. Màrquez, “Syntactic features for high preci-sion word sense disambiguation,” in Proceedings of the 19th international conference on Computational linguistics-Volume 1, p. 1–7, Association for Computational Linguistics, 2002.