3. Capítulo Prácticas de gobierno
3.4 Las formas institucionalizadas
While we have previously summarised rescoring system performance trends in each lan- guage pair,10 we have not yet contrasted trends across the language pairs. In this chap- ter on exploring suitable oracle reranking algorithms, we rescored n-best lists using two rescoring methods (Section 3.4.2) as follows:
• RESCsum: The feature weights estimated via MERT (Minimum Error Rate Train-
ing, Och (2003)) are recomputed using the difference between mean feature values of oracle and 1-best sentences as defined in Equation (3.4);
• RESCprod: The feature weights estimated via MERT are recomputed using the ratio of mean feature values of oracle and 1-best sentences as defined in Equation (3.5).
We identified the oracle translations using two metrics, namely sentence-level BLEU and sentence-level METEOR. Thus each of the two rescoring methods can be classified into two subtypes. This gives rise to four rescored systems in addition to a fifth (baseline system, i.e. a translation system with no rescoring), as follows:
• BASELINE[B]: System using weights computed using MERT with no rescoring
• RESCOREDBP ROD [bP]: System in which the MERT weights are recomputed on
the RESCprod strategy based on Oracles with respect to sentence-level BLEU score
• RESCOREDM P ROD [mP]: System in which the MERT weights are recomputed on
the RESCprod strategy based on Oracles with respect to sentence-level METEOR
score
• RESCOREDBSU M [bS]: System in which the MERT weights are recomputed on the
RESCsumstrategy based on Oracles with respect to sentence-level BLEU score
• RESCOREDM SU M [mS]: System in which the MERT weights are recomputed on
the RESCsum strategy based on Oracles with respect to sentence-level METEOR
score
We conducted experiments on the French→English language direction to maintain the continuity with experiments in Chapter 2 on treebank-based phrase extraction.11 In
order to test the language independence of our rescoring methods, we experimented on two additional languages, German and Spanish. We also experimented in the reverse direction, English→French. Thus four language directions were explored as follows:
• FR→EN: Translation system translating from French into English
• DE→EN: Translation system translating from German into English
• ES→EN: Translation system translation from Spanish into English
• EN→FR: Translation system translating from English into French
This helps us conduct contrastive analysis in two ways: (a) comparison of rescoring n-best lists when translating from English versus translating into English, and (b) compar- ison of rescoring n-best lists when translating from different languages (French, German, Spanish) into the same language (English) (Tables 3.46 and 3.47).
11 Note that we have scaled up from 100,000 sentence pairs to approximately 1 million sentence pairs
There is no consensus in reranking literature on what n-best list size of the translation hypotheses should be used. In order to test the optimal n-best list size for our rescoring methods (Table 3.48), all four language directions have each of the five translation systems (baseline and 4 rescored systems) rescored on seven n-best list sizes as follows:
• 100-BEST: Rescoring MT systems where each sentence has at most 100 alternate translations
• 250-BEST: Rescoring MT systems where each sentence has at most 250 alternate translations
• 500-BEST: Rescoring MT systems where each sentence has at most 500 alternate translations
• 750-BEST: Rescoring MT systems where each sentence has at most 750 alternate translations
• 1000-BEST: Rescoring MT systems where each sentence has at most 1000 alter- nate translations
• 2500-BEST: Rescoring MT systems where each sentence has at most 2500 alter- nate translations
• 5000-BEST: Rescoring MT systems where each sentence has at most 5000 alter- nate translations
Note that all our experiments were performed on two translation datasets: (a) de- vset and (b) testset. The parameters (feature weights) for rescoring the n-best lists were trained on the devset and tested on the testset. This implies that in the course of our ex- tensive multi-dimensional experiments, we created a total of 140 different MT systems. For each of the 4 language pairs, we created 35 translation systems (5 types of MT sys- tems [baseline and 4 rescored systems] for each of the 7 n-best list sizes). Additionally, we evaluated the system performances on 7 different evaluation metrics: BLEU, NIST, METEOR, WER, PER, OBLEU, and OMET (previously described in Section 3.7). We
computed oracles using two different metrics: sentence-level BLEU and sentence-level METEOR. Hence, we have focussed on the contrastive analysis of translations as per these two metrics. This also helped us compare the two metrics across language pairs. The purpose of this section on contrastive analyses is to try and draw some discernable patterns across the 140 MT systems.
LANG PAIR 100 250 500 750 1000 2500 5000 (a) devset EN→FR B B B B B B B FR→EN B B B B B B B DE→EN B B B B B B B ES→EN B B B B B B B (b) testset EN→FR mP bP, mP mP bP bP, mP mP B, mP FR→EN B B B B, mP bP B B DE→EN mP bP mP mP mP mP mP ES→EN mS bP B, bP, mP B B B B
Table 3.46: Summary of the best-performing translation systems across all n-best lists and all language directions as per the BLEU evaluation metric: (a) devset and (b) testset
Table 3.46 summarises the best-performing systems across all language directions (rows: English →French, French→English, German→English, and Spanish→English) in each of the seven n-best list sizes (columns: 100-best, 250-best, 500-best, 750-best, 1000- best, 2500-best, and 5000-best) for the BLEUevaluation metric. The table is divided into two sections: (a) devset and (b) testset. The abbreviations used for each of the five sys- tems are as follows: B (BASELINE), bP (RESCOREDBP ROD), mP (RESCOREDM P ROD),
bS (RESCOREDBSU M), and mS (RESCOREDM SU M). We have made the following obser-
vations.
• The BASELINE system is the best-performing system across all n-best list sizes on the devset as per the BLEU evaluation metric because all the rescoring sys- tems either underperformed or gave a similar performance (including statistically insignificant results) to the baseline. We hoped to see similar patterns across all four language directions and although none of the rescored systems outperformed the baseline, all four language directions show the same pattern.
• On the testset, in contrast to the above observation, one or more of our rescored systems is the best-performing system 19 out of 28 times (68%). Note that in any evaluation campaign it is on the testset and not on the devset that competing system performances are compared. In cases where one of the rescored MT systems gives a similar performance or statistically insignificant improvement to the BASELINE system, multiple systems are reported in the table.
• The RESCprod method is the dominant rescoring strategy across all language pairs and n-best list sizes: 18 out of 28 times (64%). A possible reason is that the RESCsum rescoring method is similar to that of a perceptron and most likely re-
quires multiple iterations to stabilise. All our rescoring methods were computed in just a single iteration post-MERT framework. Note that this observation pertains to the BLEU evaluation metric alone and may not follow the pattern shown by other metrics, especially METEOR (addressed below in Table 3.47).
• There is a distinct mismatch in performance betwen the devset and testset as re- ported in the first two observations. As the same set of feature weights were used to rescore both datasets, this may just be down to the variable nature of the data itself and deficiencies in the BLEU metric regarding non n-gram-based matching between the system translation and the reference translation (Ye et al., 2007). • The recommendation for both EN→FR and DE→EN language directions is to al-
ways use the RESCOREDM P ROD MT system as they have been proven the most effective. Each of the five competing systems only differ in the feature weights which lead to a different ranking of the n-best lists producing a different set of translations and hence a different evaluation score. A closer inspection of these pa- rameters revealed that the language model feature weight was significantly lower for RESCsumsystems. This is the most likely reason for the distinct lower performance
of RESCOREDM SU M and RESCOREDBSU M systems. While there were variations
in the remaining 13 features12as well, none were as diverse as the language model
feature.
• There are anomalous cases in both FR→EN and ES→EN where the BASELINE system starts outperforming the rescored systems as the n-best list size increases. We were not able to find a definite cause for this and further experimentation is required, but it may be down to the fact that quite simply, smaller n-best list sizes suit these language directions better. We will discuss this in more detail in Table 3.48 below. LANG PAIR 100 250 500 750 1000 2500 5000 (a) devset EN→FR bS B bS mS B B B FR→EN B B B mS mS mS B DE→EN bS bS bS bS bS bS bS ES→EN bS bS bS bS bS bS bS, mS (b) testset EN→FR mP mP B, bP, mP B B B B FR→EN B B B mS mS mS B DE→EN bS bS bS bS bS bS bS ES→EN bS bS bS bS bS bS bS
Table 3.47: Summary of the best-performing translation systems across all n-best lists and all language directions as per the METEOR evaluation metric: (a) devset and (b) testset
Table 3.47 summarises the best-performing systems across all language pairs (English →French, French→English, German→English, and Spanish→English) for the METEOR evaluation metric. We do this because we have observed in individual language pairs that the BLEU and METEOR metrics do not agree with each other possibly due to lack of recall in n-gram-based BLEU while the METEOR considers both precision and recall, as well as language-dependent processing.
We have made the following observations.
• On both the devset and testset, one or more of our rescored systems outperformed the baseline 40 out of 56 times (71% times). Compared to the BLEU metric, this percentage of systems is similar (68%) although the devset did not figure in there.
• All three FR→EN, DE→EN, and ES→EN present a similar pattern individually across both devset and testset. This is as expected. The EN→FR system demon- strated an anomaly on the testset by having the RESCOREDM P ROD system out-
perform all other systems barring the BASELINE system. Perhaps this is mostly because it is translation into French and METEOR scores are language-dependent. • The RESCsum method is the dominant rescoring strategy across all language pairs
and n-best list sizes: 37 out of 56 times (66%). Note that this observation contrasts with the BLEU metric above and we speculate that the technical differences be- tween METEOR and BLEU render one to favour one type of rescoring over other. Any analysis on a bias will require further experimentation and comparison with oracles generated by more metrics than sentence-level BLEU and sentence-level METEOR.
• The recommendation for the ES→ENdirection is to always use the RESCOREDBSU M
MT system as they have proven the most effective. As before, each of the five com- peting systems only differ in the feature weights which lead to a different ranking of the n-best lists producing a different set of translations and hence a different evalua- tion score. We speculate that a combination of the five translation model features is the most likely cause as they were observed to be the most impacting on inspecting the MERT weights.
Table 3.48 shows which n-best list size is the top-performing system in each language pair across all the evaluation metrics. We have made the following observations.
• 5000-best list sizes lead to the best-performing system the most number of times across all language directions and all metrics.
• Despite the aforementioned observation, there are cases when a smaller n-best list size suffices. Discernible patterns are visible when considering a particular metric (any one column) in isolation. For example, the METEOR metric on the testset
LANG PAIR BLEU NIST METEOR WER PER OBLEU OMET (a) devset
EN→FR n/a n/a 100 n/a n/a 750 100
FR→EN n/a 5000 2500 5000 5000 250 500
DE→EN n/a 5000 1000 5000 5000 100 100
ES→EN n/a n/a 5000 n/a 750 5000 2500
(b) testset
EN→FR 250 n/a 100 250 100 100 2500
FR→EN n/a 5000 1000 5000 5000 5000 100
DE→EN 5000 5000 500 5000 5000 100 100
ES→EN 100 n/a 500 250 n/a n/a 500
Table 3.48: Summary of the best-performing n-best list across all language pairs and all the evaluation metrics: (a) devset and (b) testset; A n/a implies none of the rescoring methods outperformed the BASELINEsystem thus nullifying n-best list
especially favours a smaller size n-best list. This is most likely because with in- creasing n-best list sizes, the complexity in terms of the search space of the number of positions to move up an oracle increases.
• Our recommendation for the EN→FRsystem especially is to use smaller than 500- best lists because they gave the best performance. It seems to be the case that translating into English requires a larger n-best list size than while translating from English.