00000356, publicada en el 2do Suplemento del RO No 820 de 17 de agosto de 2016.

In the following, we suggest a different way to estimate the phrase translation probability. The key elements of the new translation model are the alignment templates. An alignment template

is a triple2 ¼ - ¼

*, which describes the alignment

*between a source class sequence2

and a target class sequence-

. If each word corresponds to one class, an alignment template

corresponds to a bilingual phrase together with an alignment within this phrase. Figure 6.1 shows examples of alignment templates.

The alignment

* is represented as a matrix with

elements and binary values. A

matrix element with value 1 means that the words at the corresponding positions are aligned and the value 0 means that the words are not aligned. If a source word is not aligned to a target word, then it is aligned to the empty word

, which shall be at the imaginary position . This alignment representation is a generalization of the baseline alignments described in

[Brown & Della Pietra

6.1. MODEL 61 well I ja ich well I think ja ich denke well I think if ja ich denke wenn I think if ich denke wenn I think if we ich denke wenn wir think if denke wenn if we wenn wir if we can make it

wenn wir das

hinkriegen we can make it wir das hinkriegen on both an beiden on both days an beiden Tagen both days beiden Tagen I think ich denke think if we denke wenn wir make it das hinkriegen

Figure 6.1: Examples of alignment templates obtained in training.

The classes used in2 ¼ and - ¼

are automatically trained bilingual classes using the method

described in Chapter 7 and constitute a partition of the vocabulary of source and target language. In the following, we use the class function to map words to their classes. The use of classes

62 CHAPTER 6. ALIGNMENT TEMPLATES

instead of the words themselves has the advantage of a better generalization. For example, if there exist classes in source and target language that contain all town names, an alignment template learned using a specific town can be generalized to all town names.

Formally, the alignment template, denoted by the variable, is introduced as a hidden variable

of the phrase translation probability : # (6.10)

Hence, we have to estimate two probabilities. The probability to apply an alignment

template and the probability

to use an alignment template.

First, we describe the model for the probability

. We define that an alignment template 2 ¼ - ¼

* is applicable to a sequence of source words

, if the alignment template

classes and the classes of the source words are equal: 2

. The application of the

alignment template constrains the target words to correspond to the target class sequence - ¼ : 2 ¼ - ¼ * (6.11) Æ - ¼ ¼ * ! Æ $ % (6.12) Æ - ¼ Æ 2 ¼ * (6.13)

To obtain a normalized phrase-based translation model in Eq. 6.12, the function !

has

to be adjusted such that

holds. Avoiding this renormalization and setting !

! , we obtain the deficient probability distribution for

in Eq. 6.13.

The effect of this model is to obtain a smoothed version of the ‘hard’ phrase translation model designed in Chapter 5. For *

, we assume a mixture alignment between the source and target language words

constrained by the alignment matrix

*. A simple method for structuring the single-word prob-

ability * is the following: * * (6.14) * * * (6.15) A disadvantage of this model is that the word order is ignored in the translation model. The translations ‘the day after tomorrow’ or ‘after the day tomorrow’ for the German word ‘¨ubermorgen’ receive an identical contribution. Yet, the first one should obtain a significantly higher probability. Therefore, we include a dependence on the word positions in the lexicon model: * * ¼ * ¼ * (6.16)

6.2. TRAINING 63

Figure 6.2: Dependencies within the alignment template model.

This model distinguishes the positions within a phrasal translation. The number of parameters of

is significantly higher thanalone. Hence, we might run into a data estimation

problem especially for words that rarely occur. Performing a linear interpolation of both models with an interpolation parameter#L, we try to avoid this problem:

* * #L ¼ * ¼ * (6.17) #L (6.18)

Figure 6.2 gives an overview on the decisions taken in the alignment template model. First, the source sentence words are grouped to phrases. These phrases are reordered and for each phrase an alignment templateis chosen. Then, every phrase

produces its translation. Finally, the

sequence of phrases

constitutes the sequence of words

6.2 Training

This section describes the methods used to train the parameters of our translation model by using a parallel training corpus:

1. We compute for each sentence in the training corpus a word alignment matrix using one of the methods described in Section 4.4.

2. We use this word alignment matrix to estimate a lexicon probability by relative

frequencies: & (6.19) Here, &

is the frequency that the word is aligned to the wordandis the

frequency of word in the training corpus. Similarly, we estimate a position-dependent

64 CHAPTER 6. ALIGNMENT TEMPLATES

3. We determine word classes for source and target language. A naive approach for doing this would be the use of monolingually optimized word classes in source and target language. Unfortunately, we cannot expect that there is a direct correspondence between independently optimized classes. We determine correlated bilingual classes by using the method described in Chapter 7. The basic idea of this method is to apply a maximum likelihood approach to the joint probability of the parallel training corpus. The resulting optimization criterion for the bilingual word classes is similar to the one used in mono- lingual maximum likelihood word clustering.

4. To train the probability to apply an alignment template 2 ¼ - ¼ *, we use

an extended version of the method phrase-extractfrom Chapter 5. All bilingual phrases that are consistent with the alignment are extracted together with the alignment within this bilingual phrase. Thus, we obtain a count of how often an alignment

template occurred in the aligned training corpus. The probability of using an alignment template is estimated by relative frequency:

2 ¼ - ¼ * Æ- ¼ (6.20) To reduce the memory requirement of the alignment templates, we compute these probabilities only for phrases of a certain maximal length in the source language. Depending on the size of the corpus, this maximal length is in the experiments between four and seven words.

In addition, we remove alignment templates that have a probability lower than a certain threshold. In the experiments, we use a threshold of.

5. For the alignment probabilities

, we use a model that takes into account only

the distance of the two phrases:

Using as additional simplification a log-linear dependence on the distance, we obtain the following model: (6.21) Hence, we have only one alignment parameter

, which is optimized on held-out data.

In Section 6.5, we shall show how this parameter can be trained discriminatively. In addition, this model allows the development of a tight heuristic function in Section 6.4. 6. The interpolation parameter for the lexicon model#L are trained using parameter tuning

on held-out data.

In document Informativo Gerencial (página 51-54)