Finally, we test how many translation pairs our SelRel algorithm is able to acquire from the entire source vocabulary, with very high reliability of the pairs (i.e., precision of the algorithm) still remaining paramount. If we do not possess any knowledge about a given language pair, we may use only words shared across languages as lexical clues for the construction of a seed lexicon. It often leads to a lower precision lexicon, due to a problem with false friends. False friends are pairs of words or phrases in two languages or dialects that look or sound similar, but differ significantly in meaning. Some examples of Italian-English false friends are pane (bread)-pane, or kind (child)-kind for Dutch-English. For Italian-English, we have found 431 nouns shared between the two languages, of which 350 were correct translations, leading to a precision score of 0.812. As an illustration, if we take the first 431 translation pairs retrieved by the SelRel algorithm, there are 427 correct translation pairs, leading to a precision of 0.9907. Some pairs do not share any orthographic similarities: (uccello, bird),
(tastiera, keyboard), (salute, health), (terremoto, earthquake), etc.
Besides the words shared between two languages, following Koehn and Knight [155], we have also employed simple transformation rules for the adoption of words from one language to another. The rules specific to the Italian-English translation process that have been employed are: (Rule-1) if an Italian noun ends in −ione, but not in −zione, strip the final e to obtain the corresponding English noun. Otherwise, strip the suffix −zione, and append −tion; (Rule-2) if a noun ends in −ia, but not in −zia or −f ia, replace the suffix −ia with −y. If a noun ends in −zia, replace the suffix with −cy and if a noun ends in −f ia, replace it with −phy. Similar rules have been introduced for Dutch-English: the suffix −tie is replaced by −tion, −sie by −sion, and −teit by −ty. Finally, we have compared the results of the following automatically constructed lexicons: (1) A lexicon containing only words shared across languages (LEX-1).
(2) A lexicon containing shared words and translation pairs found by applying the language-specific transformation rules (LEX-2).
(3) A lexicon containing only translation pairs obtained by our SelRel algorithm appended on the TI+Cue similarity model that score above a certain threshold ∆ (that value is ∆ = 0.10 according to the findings from sect. 7.5.2) (LEX-SelRel).
EXPERIMENTS, RESULTS AND DISCUSSION 123
Italian-English Dutch-English Lexicon Correct Precision F0.5 Correct Precision F0.5
LEX-1 350 0.812 0.188 898 0.862 0.231 LEX-2 766 0.894 0.347 1376 0.901 0.322 LEX-SelRel 782 0.896 0.352 1106 0.956 0.278 LEX-1+LEX-SelRel 1070 0.879 0.429 1860 0.908 0.396 LEX-R+LEX-SelRel 1141 0.924 0.455 1507 0.964 0.350 LEX-2+LEX-SelRel 1429 0.893 0.510 2261 0.922 0.451
Table 7.2: A comparison of different precision-oriented bilingual lexicons for Italian-English and Dutch-English in terms of the number of correct translation pairs, precision and F0.5 scores.
(4) A combination of the lexicons LEX-1 and LEX-SelRel (LEX-1+LEX-SelRel). Non-matching duplicates are resolved by taking the translation pair from LEX- SelRel as the correct one. Note that this lexicon is still completely language pair independent.
(5) A lexicon combining only translation pairs found by applying the language- specific transformation rules and LEX-SelRel (LEX-R+LEX-SelRel).
(6) A combination of the lexicons LEX-2 and LEX-SelRel, where non-matching duplicates are resolved by taking the translation pair from LEX-SelRel (LEX- 2+LEX-SelRel).
According to the results from tab. 7.2, we may conclude that adding translation pairs extracted by our SelRel algorithm on top of the TI-Cue similarity model has a major positive impact on both precision and coverage. Obtaining results for two different language pairs proves that the algorithm is generic and applicable to more language pairs. The previous approach relying on work from Koehn and Knight [155] has been outperformed in terms of precision and coverage. Additionally, we have shown that the addition of simple translation rules for languages sharing the same roots might lead to even better scores (LEX-2+LEX- SelRel). However, it is not always possible to rely on such knowledge, and the usefulness of the designed SelRel algorithm should really come to the fore when the algorithm is applied on more distant language pairs which do not share many words and cognates, and word translation rules cannot be easily established. In such cases, without any prior knowledge about the languages involved in a translation process, one is left with the linguistically unbiased LEX-1+LEX-SelRel lexicon, which also displays a promising performance.
124 SELECTING HIGHLY CONFIDENT TRANSLATION PAIRS
7.6
Conclusions and Future Work
In this chapter, we have further extended our statistical framework for modeling cross-lingual semantic similarity and bilingual lexicon extraction by presenting a novel precision-oriented algorithm called SelRel, which selects only highly confident translation pairs given the knowledge of ranked lists obtained by an initial similarity model. Put simply, our aim in this chapter was to further work on the solution for research question RQ2, but now also tackling research question RQ3, that is, we wanted to test whether highly confident translation pairs may be extracted from noisy and unstructured comparable data. The precision-oriented algorithm, which can be observed as a post-processing step applied on top of the initial model of similarity, is based on two key assumptions: (1) the symmetry assumption, and (2) the one-to-one constraint. We have empirically proven the utility of these assumptions and have evaluated our algorithm and investigated its properties in a series of experiments. We have shown that the SelRel algorithm is able to produce highly reliable translation pairs, which is especially important when dealing with noisy environments such as comparable corpora without any other lexical clues.
In this chapter, we have presented the effect of the SelRel algorithm applied on top of the TI+Cue similarity model. However, the similar idea, that is, an adjusted version of the same algorithm underpinned by the symmetry assumption and the one-to-one constraint might be applied to other models of similarity, as long as these models provide ranked lists of semantically similar words.
7.7
Related Publications
[1] I. Vulić and M.-F. Moens. “Detecting highly confident word translations from comparable corpora without any prior knowledge,” in Proceedings
of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Avignon, France, 23-27 April 2012,