• No se han encontrado resultados

Políticas Las políticas de empresa son el instrumento para establecer y

2.8 IMPACTO AMBIENTAL

3.1 PLANEACIÓN ESTRATÉGICA

3.1.4 Políticas Las políticas de empresa son el instrumento para establecer y

The three types of models we experimented with in Chapters3-4capture different elements of coherence. While we expect their scores to be complementary, some

MTsystems may well do better at some aspects and not so well at others, so will score differently under the separate models. We illustrate how the different sys- tems from the WMT14 submissions (Bojar et al.,2014) score under the different types of models, including scores for the Entity-graph (as the best performing entity one), our IBM1-Syntax model (as the best performing syntax one, al- beit not actually measuring intent, as originally foreseen), and Dis-Score (our crosslingual discourse metric). This is clearly illustrated in Figure 7.1, where we visualize the scores per model of each system, summing over the 175 documents in that WMT14 submissions test set, and with all scores scaled to fall between 0 and 1 (inclusive).

As can be seen from the raw scores from the coherence models (Table 7.1), they clearly measure different aspects and result in different rankings for the submitted system outputs. Some systems will handle lexical coherence better than others, while some may better capture and transfer discourse connectives.

From the evaluation metric submissions for WMT14 (Mach´aˇcek and Bojar,

2014), we also display the results of DiscoTK-party and REDSys system level scores by way of comparison (Table 7.1), as the two metrics with the best corre- lation to human rankings for that language pair (fr-en), and therefore judged the best performing. There is variation over the score rankings, particularly for the

Figure 7.1: Comparative scores for the WMT system submissions under our different coherence models.

Entity-graph and Dis-Score, but less so for our IBM1-Syntax model. The most obvious observation to be made from these results, is that it would appear that the rules-based systems (rbmt1 and rbmt4) are scored higher under our entity and crosslingual discourse relational models (moving from the lower half to upper half of the scoreboard), which may well be due to the fact that these systems work at a higher level than the other systems. We see this as indicative of the fact that possibly some of the strengths of these more linguistic models have perhaps been overlooked by current methods of evaluation. We further illustrate the differences by visualizing the scores under the different models - shown in the Kernel Density Plots in Figures7.2-7.4. In these plots we see how the distribution over scores varies for the eight Fr-En WMT system submissions from model to model. While there is variation, and the leading systems can be identified, the profiles are remarkably similar.

Correlation with human judgements and current metrics Our models are only measuring aspects of coherence, and are insufficient as a standalone met- ric given that there are other issues which need evaluated to judge the accuracy, fluency and grammatical correctness of a translated document. Moreover, in fact the human judgements on WMT are not themselves at document level, and so are not therefore directly comparable. Human evaluation at WMT is on a window of

Figure 7.2: Distribution of scores for the WMT system submissions under our Entity Graph metric.

Figure 7.3: Distribution of scores for the WMT system submissions under our Dis-Score metric.

Figure 7.4: Distribution of scores for the WMT system submissions under our IBM1-Syntax model.

a couple of source sentences, with no target translation context, and therefore do not give credit to models which overall may have a more consistent or coherent output at document level (we continue this point in Section 7.5).

We report the correlations with human judgements in Table 7.2. As men- tioned previously, we sum over the 175 documents in that WMT14 submissions test set, and scale all scores to fall between 0 and 1 (inclusive). We cannot ex- pect a high correlation between our models alone and human rankings, because we are only capturing certain aspects of coherence, and not other measures of adequacy and correctness. Moreover the human assessors have not been asked to directly account for coherence in their sentence-level rankings. The results from the IBM1-Syntax model correlate very well with human rankings (0.941) whereas those of Entity-graph do so very poorly (-0.933). This may well be related to the fact that the human judgements are sentence-level, whereas the Entity-graph considers the pattern of entities in the document as a whole. Dis-Score correlations were discussed in Chapter 4.

To see whether our metrics are productive, as judged in terms of whether they are complementary to other metrics, we combine them linearly with the DiscoTK-party and tBleu metrics. The DiscoTK variations metrics are

System Dis-Score Syntax Entity Human Disco REDSys uedin 0.437 (3) -1623.66 (5) 238.97 (4) 1 0.829 0.0174 stanford 0.414 (7) -1598.66 (1) 231.05 (8) 2 0.768 0.0171 kit 0.414 (7) -1609.59 (3) 236.07 (6) 2 0.756 0.0171 online-b 0.417 (6) -1610.66 (4) 236.57 (5) 2 0.738 0.0172 online-a 0.448 (2) -1609.05 (2) 234.66 (7) 3 0.651 0.0169 rbmt1 0.430 (4) -1640.10 (6) 259.65 (3) 4 0.200 0.0153 rbmt4 0.459 (1) -1662.24 (7) 261.40 (2) 5 0.013 0.0147 online-c 0.421 (5) -1677.99 (8) 264.41 (1) 6 -0.063 0.0144

Table 7.1: Human ranking of 2014 WMT MT system submissions compared to raw scores from coherence models and top WMT14 metric rankings. Disco here is DiscoTK-party-tuned.

based on discourse structure, where the DiscoTK-party includes other phe- nomenon (for extended description refer to Chapter 4). tBleu, on the other

hand, is an advanced BLEU metric and is therefore based on ngram matches, with the t signifying that it is more tolerant and results in higher correlation with human judgement (Libovick´y and Pecina,2014). As such it is interesting to see whether our metric is complementary to an ngram matching one.

As already discovered in Chapter4, combining our Dis-Score to the DiscoTK-

party metric increases the correlation directly (see DiscoTK-party+DisScore). Looking at the correlation of IBM1-Syntax with DiscoTK-party, it increases the correlation from 0.970 to 0.973, even if it does not directly measure inten- tional structure to any significant degree.. Clearly our models are of benefit in that they are capturing useful information which can complement even the met- rics which already include some discourse information. Particularly interesting is the increased correlation when combining DisScore with tBleu, an increase from 0.952 to 0.963.