Recalling the example in Section2.3.2, one of the shortcomings of phrase-based mod- els is disability to learn non-contiguous phrases. [Chiang, 2005] proposed a method to learn hierarchical phrases from the word alignments. In addition to learning non- contiguous phrases, this approach learns a set of synchronous grammar translation rules that can address the reordering problem in many cases. For example, in the Ger- man, English pair “Ich habe das Haus gekauft” and “I bought the house“, apart from normal phrases, we are able to extract “habe X gekauft”, “bought X” rule. In addition to the good phrase pair, it nicely captures the reordering of the X in the rule. Decod- ing process of synchronous grammar rules is different from the decoding process for
CHAPTER3: REORDERING INSTATISTICALMACHINETRANSLATION
phrase-based models (see Section2.5). In phrase-based models we build the target sen- tence from left to right, however, here rules with gaps generate words in disconnected positions in the target sentence. Therefore, a chart parsing algorithm is used to decode the sentence by synchronous grammar rules. A full description of the decoding process is provided in [Chiang,2007].
An extension to the string-based decoder is presented in [Galley and Manning,2010] that allows discontinuous phrases such as those explained above in addition to contin- uous phrases be used without a CKY decoder. Their decoder [Cer et al., 2010] takes advantage of the better generalisations and reordering capabilities of the discontinu- ous phrases, which enables it to outperform both phrase-based decoders such Moses [Koehn et al.,2007] and hierarchical decoders such as Joshua [Li et al.,2009].
Summary
This chapter defined and explored the reordering phenomenon and the proposed ap- proaches to deal with it in the literature. Local or short-distance and long-distance reorderings were discussed and it was argued that n-gram language models alone are sufficient to address the problem and several other models have been presented to compensate the lack of evidence provided by the language models. Many approaches and models have been proposed to deal with the problem. Syntax-based approaches rely on their syntactic rules to perform the reorderings and produce grammatically cor- rect output. On the other hand, phrase-based approaches deal with most of the local reorderings with the help of extracted phrases and rely on additional features or pre- processing steps to tackle the rest of the reordering requirements.
We overviewed the lexical reordering models that are effective in phrase-based SMT decoders and also discussed the hierarchical versions of these lexicalised models. Also, some of the main syntax-based methods of SMT were presented that take a completely different approach to reordering and the the output fluency compared to the phrase- based models. We finished the chapter by the discussion of hierarchical phrase-based models and the integration of their translation model in the non-hierarchical phrase-
CHAPTER 4
Decoding by Dynamic Chunking
4.1
Introduction
Despite the success of phrase-based statistical machine translation systems, fluency of the output, particularly for long sentences still remains one of the main challenges in current research on machine translation. Most of the errors in the MT output are caused by word-order differences between the source and the target language. In this chapter, we propose a method to guide the decoder in performing permutations and enable long distance reorderings required in many language pairs. The aim of the chapter is to outline an approach that is language independent and does not need any syntax-based language dependent tools. The method is called dynamic chunking and is motivated by the fact that words move together and groups of words can be translated without reorderings longer than those that can be captured by the phrase-table.
We have mentioned before that compared to word-based statistical machine transla- tion systems, phrase-based approaches perform very well in capturing local reorder- ings. However, long distance reorderings remain a serious challenge. As Knight [1999] showed, trying all the permutations is computationally intractable, and most phrase- based MT systems restrict the search space by limiting the set of reorderings that are explored during decoding. Zens et al. [2004] examine the effect of different constraints on machine translation quality.
limit, which restricts the distance between the next phrase and the previously trans- lated phrase. Most approaches described in the literature report a distortion limit rang- ing between 4 and 12 words. This limitation of course prohibits any word reordering going beyond the set limit. This might not be a problem for language pairs with similar word order such as English-French or Dutch-German [Birch et al.,2008]. A good lan- guage model or a lexicalised reordering model [Koehn et al.,2005a] will be enough to capture the word order differences in these cases. However, when translating between languages with rather different word order, for example an SOV (subject-object-verb) language into an SVO (subject-verb-object) language, the distortion limit restriction can severely affect the decoder’s ability to capture those word order differences correctly. When translating from German (an SOV language) into English (an SVO language), it is not unusual that more than 20 words on the source side need to be jumped over to translate the verb in the right position. Figure4.1shows a German sentence trans- lated into English. The SMT decoder can not easily skip the distance between will and erfahrento correctly translate them into wants to know. The two German phrases are likely to separately be translated and hence generate a non-fluent English.
DE: Der SPD-Haushaltsexperte Johannes Kahrs will von Kanzlerin Angela Merkel Einzelheiten über die Feier im Kanzleramt anlässlich des 60. Geburtstages von Deutsche-Bank-Chef Josef Ackermann erfahren .
MT: The SPD budget expert Johannes Kahrs wishes of Chancellor Angela Merkel in the Chancellery of details of the ceremony to mark the 60th Birthday of German Bank chief Josef Ackermann learned .
REF: The SPD budget expert Johannes Kahrs wants to know from Chancellor An- gela Merkel the details of the ceremony in the Chancellery to mark the 60th birth- day of Deutsche Bank CEO Josef Ackermann.
Figure 4.1:A German sentence that requires a long distance reordering to cor- rectly translate the verb. DE is the German sentence, MT is the output of the machine translation system and REF is the human translation.
While relaxing the distortion limit accordingly may seem a possible solution to this problem, it has two severe shortcomings: Firstly, decoding time rapidly increases with
CHAPTER4: DECODING BYDYNAMICCHUNKING
more relaxed distortion limits. Secondly, wider distortion limits also allow for any re- ordering within the distortion limit which increases the level of noise and puts a higher burden on the language model to demote wrong reorderings.
In this chapter, we propose a method to enable the decoder to consider permutations which include long distance reorderings. By grouping words and moving them to- gether, we try to enable the decoder to consider long-distance reorderings and avoid unnecessary short distance permutations. In addition, our method does not rely on language-dependent parsers or chunkers and uses the word alignment information to build the chunker. In this chapter we use the term chunk for contiguous group of words. In phrase-based SMT models, a phrase is also a span of words, however there are several differences between a phrase and a chunk. Firstly, the purpose of chunking a sentence is to find a group of words that can be translated monotonically, but phrases are extracted from the word alignment data regardless of the word orders. Secondly, chunks may contain several phrases and therefore they are designed to be longer than phrases, so multiple phrases can be translated during a chunk translation. Thirdly, the method of identifying chunks, presented in Section4.3.5 is different than the phrase extraction algorithm. Finally, the chunks are only used to guide the decoder in reorder- ing decisions and are not used for word replacements. On the other hand, the main use phrases is to replace the source sentence with target words.
The rest of the chapter is organised as follows: Section 4.2 provides an overview of the related work addressing the issue of word reordering in statistical machine trans- lation and the use of chunking in particular. Section4.3explains the proposed method. Section 4.4 discusses the experimental settings and results comparing the chunking method to a baseline. In Section4.5 we draw some conclusions and discuss open is- sues. Furthermore, Section4.5analyses the shortcomings of the approach proposed in this chapter and suggests a few extensions and modifications to improve the quality of this approach.