Human judgements of machine translation quality have been used since the first experiments with MT. Human evaluation consists of having human participants, either monolingual or bilingual, judging the output of an MT system according to several different features. The type of evaluation depends on what is intended to be measured; therefore, the profile of the participants may also vary.
Some of the most frequently used manual metrics are the ratings of fluency and adequacy. Fluency and adequacy are generally measured via a Likert scale, where the evaluator is asked to assign a score to the translated segment. In White and O’Connell (1994, p.136) fluency evaluation assesses “intuitive native speaker senses about the well-formedness of the English output on a sentence by sentence basis”, while adequacy, compared against expert translations, measures the extent to which the meaning of the reference translations is present in the MT output. To evaluate fluency, the evaluator needs to be a fluent speaker of the target language. There is no need for the evaluator to know the source language, since fluency does
35
not require the automated translated sentence to be an accurate translation of the source. To judge adequacy, however, the annotator must be bilingual in both the source and target language in order to judge whether the information is preserved across translation, although, in some adequacy evaluation setups where the source is compared against high quality human translations of the source sentence, the annotator could be fluent only in the target language.
Error analysis is another common practice for evaluating MT output. It consists of the identification and classification of individual errors found in the MT output: it is “a means to assess machine translation output in qualitative terms, which can be used as a basis for the generation of error profiles for different systems” (Stymne and Ahrenberg 2012, p.1785). This type of evaluation allows for the identification of particular strengths and problem areas of MT systems and to diagnose what went wrong and which research direction to take (Flanagan 1994; Correa 2003; Vilar et al. 2006; Llitjós 2005; Stymne et al. 2012). Error analysis has also been used to identify problematic passages in the MT which can be fixed after the post-editing process, as well as passages that remain problematic even after PE is implemented (Daems, Macken and Vandepitte 2014). This approach provides rich data which can also be used to improve post-editor training.
Other frequent methods used to assess MT quality through human judgement include ranking translation, which consists of ranking translated sentences by an MT system from best to worst (Callison-Burch et al. 2007) or, in some cases, the participants are asked to assign scores to each translated sentence/segment on a pre-determined scale (LDC 2002). Reading comprehension or even comprehension tasks using the system output (Fuji 1999; Jones et al. 2005) is also one of the methods (see Section 2.6.4 – usability evaluation for detailed description). It is important to note, however, that reading comprehension tasks are rather rare in the MT evaluation field. Additionally, measuring the amount of work required to post-edit the system output (see Section 2.2.3), such as time (Sousa, Aziz and Specia 2011) and keystrokes has also been explored.
36
As mentioned previously, one of the first major projects that aimed at defining human evaluation metrics8 was DARPA's project on MTE (White, O’Connell and O’Mara 1994). Evaluators were asked to assess automatically translated sentences according to the concepts of fluency and adequacy, assigning a score from 1 to 5. Adequacy assessment was performed by comparing the MT output against the source text (White, O’Connell and Carlson 1993) and against professionally-produced human reference translations. A reading comprehension task was also part of the MTE methodology where the evaluators were asked to answer questions about the text.
The Workshop on Machine Translation adopted human judgements as a primary methodology for assessing translation quality in its 2007 edition (Callison- Burch et al. 2007 and 2008), while the first two years of the workshop were focused on automatic metrics. The evaluation process was based on the concepts of fluency and adequacy (on a scale from 1-5) and the methodology was premised on ranking translations relative to each other. In 2009, Callison-Burch et al. introduced the evaluation of post-edited sentences. The authors do not clarify whether there were qualified translators involved in the post-editing process. The annotators were asked to post-edit the sentence to be “as fluent as possible” without seeing the reference. Following this, they were asked to judge the post-edited translations by annotating with yes/no, whether the sentences were fluent considering the reference sentence.
Recently, crowdsourcing has become popular in the field of translation (crowdsourcing translation)9 and it has also been applied for human evaluation of machine translation. Callison-Burch (2009) proposes several ways to evaluate MT output by making use of Amazon’s Mechanical Turk10, a platform to crowdsource content that is based on tasks. The author experiments with crowdsourcing for ranking translation from best to worst; creating multiple references by translating the source text; detecting machine translated sentences by selecting the sentences
8
Although The ALPAC (Automatic Language Processing Advisory Committee) report had already used human ratings of intelligibility back in 1966.
9
Crowdsourcing translation has often been used as synonymous for community translation, user- generated translation and collaborative translation (O’Hagan 2011).
37
that ‘look like’ they came from an MT system; post-editing of machine translation; judging post-edited translation by ranking those that are close to the reference translation; reading comprehension tests by i) reading the text and creating questions about it and, ii) reading the text and answering questions about it.
Crowdsourcing has also been explored in discussion forum contexts (Mitchell 2015; Mitchell, O’Brien and Roturier 2014). In Mitchell, O’Brien and Roturier (2014), the authors report three quality evaluation methods for community post-edited content. First, the authors ask for community members from the German Norton Community11 to post-edit twelve texts taken from the English-speaking community. Afterwards, the post-edited content was evaluated for fluency and fidelity by domain specialists, an error annotation of MT was performed by a trained linguist and fluency was rated by community members. The results show that the community evaluation and the evaluation performed by domain specialists have similar results.
Crowd assessments of MT may allow evaluations on a large scale while being cost-effective, however, as Zaidan and Callison-Burch (2011, p.1221) outline, “soliciting translations from anonymous non-professionals carries a significant risk of poor translation quality”. Even though there may be professional translators in crowd communities (O’Hagan 2011), the quality of work in crowdsourcing is generally not guaranteed, since the crowd may employ as little time as possible in the tasks or even employ someone else to do the test for them (Graham et al. 2013). In order to tackle these problems, several methodologies have been developed to filter the evaluations. One method is to compare crowdsourcing with expert evaluations (Callison-Burch 2009; Goto, Lin and Ishida 2014); however, it is not clear what some authors mean by “expert”, i.e., some may refer to trained translators, whereas others may refer to computational linguists who develop machine translation systems, as is the case of Callison-Burch’s methodology (2009). Another methodology presented to tackle the crowdsourcing problem is that of Graham et al. (2015) who propose to collect judgements on a continuous rating scale, whereby the crowd develops their own individual assessment strategy by assessing each translation in isolation (i.e. not comparing different translations at
38
the same time as is done, for example, in ranking translation tasks). According to the authors, this methodology has the advantage that agreement with the expert is no longer required, and more meaningful statistics can be computed.
As seen from the above, there are several ways to apply human evaluation for MT. The main advantage of using human evaluation for MT is that it can assess deep linguistic information that provides reliable insight for error analysis, which in turn, can help to understand the actual linguistic strengths and weaknesses of an MT system (Gaspari et al. 2014). However, human evaluation can be very subjective, suffering from disagreement when annotators are not well trained for the task (Callison-Burch et al. 2011; Bojar et al. 2011); it may also be time consuming (depending on the scale of the task) and expensive.