CAPITULO I: PLANTEAMIENTO TEORICO
5. MARCO TEORICO
5.4. DEMANDA
To respond to this question, we will need to observe the performance of our tools when judging the quality of word order, and compare that performance with that relating to overall sentence quality. We thus build both questions into our survey in Chapter6, before reporting our results in Section7.3.4.
3.3 Important restrictions
While our goal is to perform as comprehensive as possible an investigation into the rel- evance of structure to machine translation, we are nonetheless practically unable to con- sider every possible facet of that relationship. In practice, our experiments are restricted by three factors: the language we use, the aspect of language which we use for compari- son, and the scope of our test dataset.
3.3.1 Use of English
The first and most significant restriction to our experiments is their absolute focus on English. As a result of working in an entirely English-speaking environment, without straightforward access to large numbers of native speakers of any other language, we have chosen to investigate only translation into English – although we place no restrictions on source language. As a result, any conclusions we draw about word order will be applicable only to English.
We believe this choice to be significant, both as a benefit and a limitation. Conve- niently in our case, a large number of high-quality resources exist in this language, of which some are described in Section3.4. As such, we will be able to manipulate the text we receive using third-party tools more easily, more flexibly, and with more confidence in quality than would be possible using most other languages. This allows us to work with text which has been preprocessed in any way we feel is most useful for the algorithms we devise.
The downside to the use of only one language in our experiments is that we will be unable to judge the scope of any conclusions we draw. It is well known that word order is an important factor for comprehension of English, but is less so in other languages. For example, in many morphologically rich languages the information which in English is encoded in the ordering is instead indicated through cases or other word modifiers. To speakers of such languages, the concept of ‘quality of ordering’ may be different from that of native English speakers, and may be unimportant or even meaningless.
Despite this, we believe data relating only to English is far from worthless. Beyond the obvious fact that conclusions related to English are of practical use due to the ubiq- uity of the language across the world, we believe that the conclusions we draw will be applicable both to other languages to greater or lesser extents, and to our understanding of structure in a language-independent sense.
3.3.2 Adequacy over fluency
As briefly mentioned in Section2.2, evaluation of machine translation is often split into the two evaluation criteria of Fluency and Adequacy. The former refers to the extent to which a sentence uses language correctly and idiomatically, as a native speaker would,
while the latter indicates how much of the meaning of the source sentence can be under- stood from the translation.
When producing both our evaluation metrics and the human judgments with which we will evaluate those metrics, we have chosen to prioritise just one of these two criteria: adequacy. The reasons for this are several, though the lack of consideration for both factors once again limits the conclusions we can draw from our experiments.
The most significant reason for applying this limit is a practical one: we do not con- sider that we have the resources to investigate both fluency and adequacy separately. Doing so would arguably require separate metrics with different design decisions, dra- matically increasing the time required to produce them and also the complexity of any analysis. It would also add complexity to the evaluation we passed to human participants (Chapter6), reducing the number of sentences which could feasibly be evaluated given the resources available.
Our second reason for choosing to measure only one of the two most popular evalua- tion criteria is that such an omission does not directly mean that no information is avail- able about the ignored one. This is because the two types of assessment are inter-related, as ‘annotators have difficulty drawing any meaning from highly disfluent translations, leading them to provide low adequacy scores. Similarly, for a translation to fully express the meaning of a reference, it must also be fully, or near fully fluent’ [Denkowski and Lavie, 2010]. Thus, any conclusions we draw about adequacy alone may also suggest information about the quality of fluency in the translation set – although the strength of such information is unknown.
The reason why we have specifically chosen adequacy over fluency is simply related to our opinion of it as the more relevant feature. Given the arguable primary purpose of machine translation as a method of communicating a message to individuals who do not understand the source language, we consider that errors in adequacy are more harmful to this goal than those of fluency. Consider sentence pair 6 in Table3.1, in which two words have been swapped between the two translations. While the sentence remains entirely fluent, its meaning has been dramatically changed. We intend for our metrics to detect and penalise such errors, reflecting their impact to adequacy rather than fluency.
3.3.3 Sentence-pair scores only
The final restriction we place on our metrics is to design them to produce scores only for individual sentence pairs. This is a much weaker limitation than those described above, as it does not in itself prevent us from producing meaningful evaluations which fully address the questions we have put forward in Section3.2. It does, however, prevent us from following two common trends in machine translation evaluation: first, we do not take into account multiple reference translations for a single hypothesis; and second, we do not produce system-level scores.
A large body of research exists which suggests that evaluation using multiple refer- ence translations can produce better results [Papineniet al., 2002; Fomicheva and Specia, 2016]. The intuition behind this is that most sentences can be translated in a variety of ways, so comparison metrics like ours run the risk of penalising certain hypotheses sim- ply because, for example, they prioritise different aspects of the source, even if both approaches are valid.
3.3. IMPORTANT RESTRICTIONS 25
For example, consider the Polish phrase “Zakasał r˛ekawy”, which literally translates to “He rolled up his sleeves”. While the literal translation is a valid idiom in English, depending on context an entirely legitimate translation could equally be “He got ready to work” – which would however be considered by many automatic metrics to be a dra- matically different (and thus incorrect) sentence. By providing multiple reference trans- lations, we reduce the likelihood of such a mismatch occurring.
While sentences with multiple reference translations can provide reliability by in- creasing the information available when scoring an individual sentence, another approach to ensuring reasonable judgments is simply to produce scores relating to an entire system at once.
Such scores are based on the assumption that virtually any automated algorithm will produce an unintuitive or unreasonable score in some situations, as a simple consequence of the enormous complexity and diversity of natural language. Further, it is natural to assume that some sentences’ qualities will be overestimated, while others will be under- estimated.
If the scores for all sentences translated by a given system are aggregated using a simple technique such as the arithmetic mean, it is hoped that such inaccuracies will be ‘smoothed out’. As a result, the ‘system-level’ score is expected to be a more reliable measure of that system’s translation quality than a measurement based on any individual sentence.
This expectation is strongly vindicated in practice: when automatic judgments are compared with humans’ using various techniques, system-level correlations of 0.7 or more are commonly reported while segment-level correlations are often close to 0.3 [Lavie and Denkowski, 2009; Fishel et al., 2012b; Stanojevi´c and Sima’an, 2014a]. While this may in part simply be due to the sparsity of data involved in many system-level correlations, it is nevertheless a powerful trend.
An additional benefit of the use of system-level scores is their simplicity when train- ing a system. A common procedure in machine translation is to run a metric such as BLEU, alter the translation system in some way then run the same metric again on the output of the updated system to determine whether the change resulted in an improve- ment [Och, 2003]. While scores describing the entire system are easy to use in such a situation, scores for individual sentences are not directly relevant.
Given these benefits of system-level and multiple-reference scores, why then have we limited ourselves to simple sentence pairs? The reasons are to do with the practicalities of our evaluation environment, or more specifically a consequence of the fact that we are attempting to produce judgments which are not a standard of the machine translation community. As discussed in more detail in Chapters6and7, we have chosen to produce our own judgments on word ordering specifically.
Without the resources of more major evaluation environments such as the Workshops on Machine Translation [Bojaret al., 2014, 2015, 2016a], our database of judgments is limited in scope, with a total of 1783 sentences scored. While adequate for our purposes when considering sentence-level scores, the separation of such scores into individual systems – based either on the real translation tools used or on more synthetic divisions produced by, for example, bootstrap resampling [Stine, 1989] – would result in ‘systems’ containing too few datapoints from which to draw reliable conclusions.
two reasons. The first is due to the source of our sentences, discussed in Section6.1.1: the shared tasks of the Workshops on Machine Translation, while providing diverse and plentiful translations, do not incorporate the use of multiple reference translations. The second reason is an assumption that incorporating multiple reference translations in the survey we used to collect our human judgments (Chapter6) would have overly compli- cated it from the point of view of our non-expert participants.
While the limitation to sentence-pair scores prevents our tools from being as broadly applicable as they might otherwise be, we do not consider them to pose a severe problem. This is primarily because such a feature would, in our view, be considered part of the process of fine-tuning and perfecting an approach. In practice, given the significant lack of existing structural tools in word order evaluation, our own are intended primarily as proofs-of-concept, demonstrating the validity of their approach rather than aiming to be the last word in the area.
Additionally, note that the above reasons for not incorporating system-level or multi- reference scores are related to our experimental setup, not our algorithms themselves. Indeed, should system-level scores be desired in the future, their addition is a simple process. Various techniques have been proposed in literature, such as the logarithm- based geometric mean of BLEU (see Section 3.5.3), but one of the most common and simplest is a simple arithmetic mean of all sentence-level scores.
Similarly, support for multiple reference translations can be added without signifi- cantly altering the core algorithms we present. This could be done in the same manner as Meteor and others [Giménez and Màrquez, 2007], where “If more than one reference translation is available, the translation is scored against each reference independently, and the best scoring pair is used” [Lavie and Agarwal, 2007].