Capítulo 4 Episodios comparados de la vida de Alejandro según diversos autores
4.12 Conjura de los Pajes
is 那 么 为 什 么 这 样 呢 ? that ? slen=1 tlen=1 is why 那 么 为 什 么 这 样 呢 ? that ? slen=2 tlen=2 why 那 么 为 什 么 这 样 呢 ? that ? slen=3 tlen=3 why is 那 么 为 什 么 这 样 呢 ? that ? slen=5 tlen=5 why is 那 么 为 什 么 这 样 呢 ? that ? slen=4 tlen=4 why
Figure 4.2: The translations generated with the phrases pairs from Figure 4.1 under phrase-length limitations.
length of the used phrases is about 3 words [Koehn & Och+ 03, Haffari & Roy+ 09]. This number is small, considering an average sentence length of around 10 ∼ 20 words. Thus, a portion of insertion and deletion errors in translations depends on the presence of unaligned words in phrases.
4.4 Deletion of the unaligned words in source sentences
On the basis of the observations in the last section, we make a further analysis on the unaligned words. In the automatically trained alignment, an unaligned word could be classified:
correct vs wrong: A word is supposed to be unaligned correctly if it does not have any corresponding translations even in a manual alignment. Whereas an unaligned word is aligned wrong if it has any aligned target words in a manual alignment.
4 Treatment of unaligned words in word alignment
words and content words, it can be noted that the correct unaligned words are roughly function words, while the wrong unaligned words are usually content words. The function words carry little lexical meaning. Instead, they are meant to express grammatical relations among words in a sentence. On the contrary, the content words usually carry the meaning of a sentence.
The role of these unaligned words is not clear yet. On the one hand, they could con- tribute to the generation of more phrase pairs, which might increase the ambiguity of a translation. On the other hand, they tend to glue the remaining sentence components in order to produce a fluent translation. In order to investigate whether these unaligned words are more useful or harmful for the translation, we have applied two strategies for the unaligned function words in the source language: hard deletion and optional deletion. These methods have been already mentioned in Section 4.4.2 and Section 4.4.3. The dele- tion of unaligned words in the source language can reduce the size of the phrase table. The unaligned words in the target language are not deleted in the training data, since they are crucial for the completion of accurate translation.
Our next question refers to the type of unaligned words to be deleted. According to the analysis on the unaligned words in the training data, not all unaligned words could be deleted. We hope to delete the “correct” unaligned words only.
4.4.1 Deletion candidates
We have used two constraints to filter out the words which can be deleted.
We have used relative frequencies to estimate the probability of a word being aligned. pa(w) =
Na(w)
N (w) (4.2)
The number of times a word w is aligned in the training data is denoted by Na(w), and
N (w) is the total number of occurrences of the word w. The first constraint remove all words whose aligned probability falls below a threshold τ .
Conp(w) =
1 if pa(w) ≤ τ 0 if pa(w) > τ
(4.3)
This constraint can be used with different thresholds. The smaller the threshold is, the stricter constraint is applied and fewer words are to be considered. When pa(w) is 0.5, it
means that the word has the same probability to be or not to be aligned.
The second constraint imposes the use of the POS tags to mark the function words. In general, the content words include nouns, verbs, adjectives, and most adverbs. We denote the POS tag set for content words as C = {noun, verb, adj, adv}. Thus, the constraint for the function word is:
4.4 Deletion of the unaligned words in source sentences
Conf(w) =
1 if P OS(w) 6∈ C
0 otherwise (4.4)
Usually, the second constraint with POS tags should be used together with the first constraint Conp(w), since content and function words in linguistics are not always clearly
distinguishable.
4.4.2 Hard deletion
The simplest way of deletion is to remove directly the words which fulfill the constraints from the source sentences in both training and test data. We call this “hard deletion”. With the “hard deletion”, the change of the alignment will affect not only the extracted phrase pairs around the deleted word, but also the probability estimation of all phrases. Namely, the alignment takes place in different contexts. In this way, the source sentences become relatively shorter. The size of the phrase table will be smaller because of the reduction in the multiple translation pairs. However, the drawback of the method is obvious. Most words are aligned or not in different contexts. When we set τ in the constraint (Equation 4.3) greater than 0 and delete the filtered words, there must still be some words left for a further translation, which means that they were deleted wrongly. Hard deletion is an easy method to investigate the influence of unaligned words on trans- lation results. It can reflect that translation phrases with unaligned words are useful or harmful in phrase-based translation systems.
4.4.3 Optional deletion
The application of optional deletion seems a better way to deal with the unaligned words. The training data is not changed in this method. But for the test data, by deleting some words, the changed source sentence is an additional input with the original source sentence to the decoder. Thus, we do not make a firm decision to delete words. Instead, we preserve ambiguity and defer the decision until later stages.
In order to represent alternative inputs, we use a confusion network (CN) for the multiple inputs. The use of the confusion network (CN) in machine translation has already been reported [Bertoldi & Zens+ 07, Hoang & Birch+ 07]. A confusion network is a directed
acyclic graph in which each path goes through all the nodes, from the start node to the end node. Its edges are labeled with words. An example of a confusion network for optional deletion is shown in Figure 4.3.
The special empty-word ε represents a word deletion. Additionally, the word alignment probability is attached to each edge. The probability is calculated by Equation 4.2. When a given word is a content word, its alignment probability is 1.0. The score with ε stands
4 Treatment of unaligned words in word alignment
ε.
62ε.
79 把 .38 机票1.0 忘 1.0 在 1.0 家里1.0 了 .21gloss: ‐ ticket forget at home ‐
Figure 4.3: A confusion network example of optional deletion.
for the probability of the word in the same column which should not be aligned. The probability equals to 1 − pa(w).
Input source sentences are represented by confusion network. Similar to what has been done in the hard deletion, the alignments are modified by removing all deletion candi- dates and the corresponding points in the alignment matrix. However, in order to match the possible non-deletion of the unaligned words, the original alignment is also needed. Therefore, we combine the two alignments by merging the phrase counts and recomputing the phrase probabilities.