• No se han encontrado resultados

Cuadro Comparativo Mes 2013 2014 Diferencia

ADMINISTRATIVO VENDEDOR TOTAL

3.2.3 Priorización del Plan de Acción

3.2.3.2 Criterios de la Auditoría de Gestión.

Evaluation in the CAT scenario adds a number of additional aspects to (machine) translation evaluation, being mostly conőned to measuring quality by string- or semantic equivalence metrics.

After translation quality, the most important quantity to measure in the transla- tion process is time, since it directly or indirectly affects cost of translation in the professional context. In terms of quality, a further aspect comes into play, since reference translations are naturally unavailable in a realistic setting, automatic quality estimation techniques [Specia et al., 2009; O’Brien, 2005] are needed if one does not want to carry out a manual evaluation. However, in arranged user studies, a ground truth is often attainable either by independently generated reference translations and/or by (bilingual) human evaluators which assess the correctness of produced translations.

There is also a more translator oriented perspective on evaluation in CAT Ð trying to measure technical and cognitive effort the translator has to invest in the translation process [Krings, 1997]. While technical effort can be readily measured by counting the number of observed user actions, such as keystrokes and mouse movements, measuring cognitive efforts is more complex, i.e. passively by making use of eye-tracking devices [Sekino, 2015; Sharmin et al., 2008], or by using the so-called read-aloud protocol [Krings, 1997], making the translator explain his thoughts on the translation, to gain insights into the cognitive process that is taking place while translating. Time is however a good indicator for cognitive effort [Koponen et al., 2012] (but not necessarily for technical effort), as is measuring the number and durations of pauses [Lacruz et al., 2014, 2012]. See [Lacruz et al., 2014] for an overview on pause-related measures for assessing cognitive effort.

For use in our own work we őrst discuss automatic evaluation of quality in CAT, then discuss how to efficiency and speed can be measured.

5.5.1 Measuring Speed

Time can be easily measured in CAT by including a timer which is active during translation. Different approaches can be followed for normalization, since raw time

15

For suffix-array-based rule extraction for phrase-based models [Germann, 2015] efficient imple- mentations have been described [Bertoldi et al., 2017].

is not a comparable quantity. Another aspect is how to handle pauses that occur in the translation process.

Normalization is preferably done using the length of the őnal translation or post-edit, since the source lengths can be considerably different depending on the language. This normalization by actually produced (or conőrmed) characters or words is a realistic approach. In our work we normalize time by target characters (excluding spaces).

The total post-editing time can also be divided into a number of distinct phases, i.e. assessment-, editing- and reading time. See [Pinnis et al., 2016], for a more őne-grained analysis.

Another common approach to measure speed is throughput, which can be measured as words per hour, minute or working day.

Intuitively, post-editing time is positively correlated with source segment length [Popovic et al., 2014; Zaretskaya et al., 2016; Koponen, 2012].

5.5.2 Measuring Effort

In CAT, effort refers to any work that is done during the creation of a translation. It is an ambiguous term, since it can refer to technical effort, which is the actual amount, or to an estimate of the practical work that is done to create the őnal translation, or to cognitive effort, which can be described as the amount of mental processing that goes into the translation process. The cognitive effort can however be approximated by time measurements, see e.g. [Popovic et al., 2014].

Most commonly, technical effort is measured as a string edit distance, e.g. in post-editing between initial translation hypothesis and őnal post-edit. TER or human-targeted TER (HTER) as presented by [Snover et al., 2006] is a standard approach. However, since the original procedure is costly to carry out16, the most common method is to simply calculated TER of the MT output against a single post-edit which created from just the same MT output. The same approach can be taken with the BLEU score, or sentence-wise BLEU for per-sentence measurements.

The previously described methods only measure technical effort indirectly, since they only consider the minimal amount of edits needed to arrive at the őnal translation.

In a user study (or by simulation with certain assumptions) a direct approach can be taken, by directly recording user actions, e.g. keystrokes and mouse actions. Barrachina et al. [2009] propose normalization over character in the post-edit, resulting in three metrics: keystroke ration (KSR17

), mouse-action ratio (MAR), and keystroke and mouse-action ration (KSMR), which is just KSR + MAR.

16

It involves an independent reference translation, as well as a number of different, independently generated post-edits.

17

Koehn and Germann [2014] present another method to estimate technical effort ś character or word provenance, which can measure which characters or words had to be actually typed and which ones were automatically proposed.

Finally, O’Brien [2011] proposes to measure cognitive effort through observation of translators using eye tracking. Measurements of őxation time and counts can give an estimate of the effort involved.

5.5.3 Measuring Quality

Final translation quality in CAT can be straight-forwardly measured if there are independently created and validated reference translations available. However, if these are not available, one has to resort to manual human judgments or to automatic quality estimation techniques.

Documento similar