Preguntas que promueven futuras discusiones

6. CONCLUSIONES

6.2 Preguntas que promueven futuras discusiones

The denition of evaluation methods and evaluation metrics for translation environment tools has been the focus of several papers and works, which serve as a basis for the method- ology of this thesis.14

2.3.1 Modeling the evaluation

The tests performed within the scope of this study were conceived using the black box testing approach: the M(A)T system is seen as a black box whose operation is treated purely in terms of its input-output behaviour, without regard for its internal operation, (Trujillo, 1999, 256). Black box testing is suitable for end-users, see (Quah, 2006, 138), and is applied in most publicly available evaluations, e.g. (Seewald-Heeg, 2007, 562). Although the evaluation is conducted from a translator's point of view, the results can also be of interest for developers of translation environment tools.

In particular for match values, if the settings governing them were not accessible/visi- ble, the tests were constructed to reveal the strategies and rules used by these systems, (Way, 2010a, 555), which is typical of reverse engineering. In software engineering, reverse engineering involves examining and analyzing software systems in order to recover information, particularly functional specications (adapted from (Wills, 1996, 7)).

The tests are task-oriented and examine whether and the extent to which a piece of software oers functions to perform specic tasks, (Höge, 2002, 138). In accordance with the peculiarities of task-oriented testing, some tasks were repeated when problems were encountered or dierent settings had to be tested.

Additionally, taking a closer look at some of the features of the TM systems is typical of feature inspection, dened as describing the technical features or a system in detail. The purpose is to allow an end-user to compare the system with other systems of similar kind, (Quah, 2006, 144).

As these tests compare several TM systems, the use of dierent methods (task-oriented testing and feature inspection) was considered appropriate, see (White, 2003, 235).

14_{Examples include: Commission of the EU (1996), Reinke (1999), Whyman and Somers (1999), Rico}

(2001), Höge (2002), Gow (2003), Reinke (2004), Massion (2005), Seewald-Heeg (2005) and Lagoudaki (2008).

2.3.2 Test procedures

The test procedures dier depending on the main objective for the placeable and localizable element group: recognition, see 2.2.1, or retrieval, see 2.2.2. This section is therefore or- ganized in two corresponding subsections (2.3.2.1 and 2.3.2.2, respectively). The corollary objectives, see 2.2.3, are evaluated in both cases along the way.

2.3.2.1 Recognition

A segment containing one or more relevant placeable and localizable elements was opened for editing in the editor of the TM system. With some TM systems, the recognized elements were highlighted and recognition could be assessed directly. For other TM systems, a workaround was sometimes necessary, see 3.2.2 and 4.2.2 for further details. It was not necessary to provide any translation because recognition assessment only needs the source language.

2.3.2.2 Retrieval and automatic adaptations

To assess retrieval and automatic adaptations, a rst segment was translated and saved in the translation memory. The subsequent segments that diered from the rst one were opened in order to ascertain the proposed match value, but not translated. Therefore, the match value was calculated and the automatic adaptation applied with respect to the rst segment.

2.3.3 Metrics

No global evaluation metric or scoring system was developed because the test objectives concentrated on the identication of the problems related to placeable and localizable elements irrespective of the TM system. If a global ranking were required, it would be necessary to assign scores to all tested features and to develop a weighting system. In fact, it would be rst necessary to check whether and how these features can be measured, see Höge (2002) and Gow (2003). Throughout this work, several features are discussed that are dicult to express as one single value, e.g. the display of elds in chapter 10. Which display suits the user's way of working best is largely a matter of preference. A global metric would be of little help and the weighting of the dierent components (recognition, retrieval, display, etc.) would be extremely arbitrary. Specic metrics and criteria, however, were applied in order to better assess recognition and retrieval performance and can be integrated into existing frameworks for future evaluations.15

15_{Evaluation frameworks are dened as general guidelines or procedures designed [...] as the basis for}

2.3.3.1 Recognition assessment

In chapters 3, 4 and 5, specic metrics were necessary to interpret recognition results more easily.16 _{In order to better understand those metrics, the basic measures of precision and}

recall are introduced. Precision and recall are typical evaluation parameters in information retrieval.

Precision is dened as the ratio of relevant items retrieved over all the items retrieved, (Trujillo, 1999, 63).

precision= tp

tp+f p (2.1)

tp stands for true positives, fp for false positives.

On the other hand, recall is the proportion of the target items that the system selected, (Manning and Schütze, 1999, 269). (Trujillo, 1999, 63) adapts the denition of recall to TM systems as follows: ratio of the relevant items retrieved over all relevant items in TM.

recall = tp

tp+f n (2.2)

tp stands for true positives, fn for false negatives.

When recognizing e.g. numbers, TM systems select tokens17 _{that are deemed to be}

numbers. True positives are correctly recognized numbers, whilst false positives are recognized elements that are not numbers. Precision assesses whether incorrect recognition is common: a score of 1 means no errors. However, TM systems can miss numbers and produce false negatives, which are not accounted for in the precision result. This is the reason why recall is used: again, a score of 1 means no errors.

Generally speaking, precision and recall tend to be inversely correlated (as later ndings will conrm), so that a trade-o is inevitable if the values of both measures have to be approximately equal, see (Manning and Schütze, 1999, 269). In order to obtain a measure of the overall performance, and assuming equal weighting18 _{of precision and recall, the F}

measure is used:

F = 2× precision×recall

precision+recall (2.3)

Can these measures be applied to test results? For the purposes of this study, a test suite was built (see 2.4.1.4) so that, from a methodological point of view, screening of the input had already taken place. Consequently, precision could not be reliably assessed, as false positives were underrepresented in the selected examples. More telling is the recall measure, as the examples of the test suite were intentionally selected in order to generate true positives.

16_{Similar metrics could have been used in chapters 6 and 7, but this was not necessary because the}

results are self-explanatory.

17_{A token is an instance of a sequence of characters [...] that are grouped together as a useful semantic}

unit for processing, (Manning et al., 2009, 22).

The evaluation of the results suggests that a graduate score is necessary in order to ac- curately assess recognition. The details of this score are described in 3.2.2, 4.2.2, and 5.2.2, respectively. A discussion of the specic ranking derived from these metrics is provided in 13.2.4.

2.3.3.2 Retrieval assessment

Chapters 8 to 12 concentrate on retrieval performance. In order to assess and compare the results provided by TM systems, two criteria were adopted: the number of retrieval errors and the number of automatic adaptations. Relative gures are presented in 13.3.3 and 13.3.4, which also detail the rules used to quantify them, while 13.3.5 provides the resulting specic ranking.

In document Procesos de conjeturación y justificación : el rol de los programas de geometría dinámica (página 124-138)