Los destinatarios de la presente Directiva son los Estados miembros ▼B
DECLARACIÓN RELATIVA A LOS PRODUCTOS QUE TENGAN UNA FINALIDAD ESPECIAL
In the semantic textual similarity (STS) task,2 we use our model to compute the distance between a pair of sentences (distances are equivalent to cosine similarities and therefore lie in the [0, 1] real interval). Gold standard scores for all test sets are given in the [0, 5] interval, where 0 means complete dissimilarity and 5 complete similarity. We simply use the cosine similarity distance computed by our model and scale it by 5, directly comparing it to the gold standard scores.
We note that in our STS experiments, we use the same models trained on the M30kC applied onto the image–sentence ranking task (Section 5.3.1). We report re-
sults for all semantic similarity tasks for which test sets are publicly available (Agirre et al., 2012, 2013, 2014, 2015, 2016). These test sets include excerpts from the news domain, machine translation evaluation, forum answers, video descriptions, among others. Some of these test sets are highly out-of-domain when compared to the images and their descriptions used to train our MLMME models. Moreover, since there is no SemEval data set including the German language, we only use the En- glish SemEval test sets. As an illustration, in Table 5.2 we show examples of entries from the different test sets.
Specifically, we embed both sentences in each of the test sets with our English encoder, trained as part of the MLMME, and also the VSE English encoder as our main baseline. We note that the vocabulary of the MLMME and VSE models are derived from the M30kC training data, and in case there are any out-of-vocabulary
words in the test sets, they are replaced by a special UNK symbol.
Amongst all test sets, there are two in-domain similarity tasks—image descrip- tion similarity for years 2014 and 2015—and all the other tasks can be considered general- or out-of-domain.
In Table 5.3, the entries corresponding to the corresponding year’s best SemEval 2
SemEval 2012 (Agirre et al., 2012) – MSRpar
Sent. 1 The problem likely will mean corrective changes before the shuttle fleet starts flying again . Sent. 2 He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again . Score 4.4
SemEval 2013 (Agirre et al., 2013) – OnWN Sent. 1 measure the depth of a body of water Sent. 2 any large deep body of water . Score 0.8
SemEval 2014 (Agirre et al., 2014) – Tweet-news
Sent. 1 Hollywood Accepts Chinese Censorship ( Will Movies Get Any Better ? ) Sent. 2 In Hollywood Movies for China , Bureaucrats Want a Say
Score 2.4
SemEval 2014 (Agirre et al., 2014) – Image descriptions Sent. 1 A cat standing on tree branches .
Sent. 2 A black and white cat is high up on tree branches . Score 3.6
SemEval 2015 (Agirre et al., 2015) – Headlines
Sent. 1 The foundations of South Africa are built on Nelson Mandela ’s memory Sent. 2 Australian politicians lament over Nelson Mandela ’s death
Score 1.3
SemEval 2015 (Agirre et al., 2015) – Image descriptions Sent. 1 The couple is sitting near the water in lawn chairs .
Sent. 2 The boy hops from one picnic table to the other in the park . Score 0.0
SemEval 2016 (Agirre et al., 2016) – Plagiarism
Sent. 1 There are two main approaches for dynamic programming . Sent. 2 There are four steps in Dynamic Programming : 1 . Score 1.0
Table 5.2: Example entries for different SemEval test sets (Agirre et al., 2012, 2013, 2014, 2015, 2016).
model are the ones reported by the official shared task at the time the official results were released.3 In Table 5.3, we note that our multilingual model consistently
improves on the monolingual baseline of Kiros et al. (2014) in the two in-domain similarity tasks, staying competitive even compared to the best performing model in the SemEval shared task (entries marked with a † in Table 5.3). In fact, the only time model MLMME outperforms the best comparable SemEval model is in the image description similarity tasks (in 2014, our best model achieves 0.826 Pearson rank correlation, whereas the best results in SemEval 2014 is 0.821; in 2015, our best model achieves 0.886 Pearson rank correlation, versus 0.864 for the best SemEval 3These best SemEval models are the ones which ranked first overall considering all test sets in
Test set Kiros Our model SemEval β=1 β=.75 β=.5 β=.25 best model SemEval 2012 (Agirre et al., 2012)
MSRpar .083 .017 .043 .031 .013 .630 MSRvid .799† .780† .792† .809† .805† .873 SMT Europarl .420 .414 .426 .446† .401 .528 OnWN .539 .462 .473 .519 .496 .664 SMT news .376 .346 .337 .340 .333 .493
SemEval 2013 (Agirre et al., 2013)
FNWN .092 .036 .014 .033 .079 .581 headlines .442 .409 .391 .407 .388 .764 OnWN .389 .544 .575 .585 .571 .752
SemEval 2014 (Agirre et al., 2014)
deft-forum .339 .239 .188 .230 .244 .482 deft-news .524 .351 .401 .347 .390 .765 headlines .442 .349 .350 .379 .391 .764 images .791† .797† .819† .826† .817† .821 OnWN .520 .560 .556 .579 .624 .858 Tweet-news .402 .345 .344 .404 .376 .763
SemEval 2015 (Agirre et al., 2015)
answers–forums .248 .231 .234 .284 .244 .739 answers–students .584 .424 .444 .425 .459 .772 belief .488 .460 .439 .455 .479 .749 headlines .424 .409 .407 .447 .442 .825 images .834† .880† .882† .885† .886† .864
SemEval 2016 (Agirre et al., 2016)
answer–answer .399 .212 .253 .288 .362 .692 headlines .314 .316 .282 .309 .303 .827 plagiarism .573 .473 .502 .534 .515 .841 postediting .710 .701 .685 .699 .680 .835 question–question .336 .353 .332 .212 .252 .687
Table 5.3: Pearson rank correlation scores for semantic textual similarities in dif- ferent SemEval test sets (Agirre et al., 2012, 2013, 2014, 2015, 2016). Best overall scores (ours vs. baseline) in bold. We underline a score in case it improves on the monolingual baseline of Kiros et al. (2014) and mark it with † in case its difference from the best SemEval result is less than 10%.
2015’s results).
One interesting point to note is that the only two evaluation sets where the β parameter is monotonically aligned to the correlations with the human judgements are the two in-domain tasks (image description similarity in 2014 and 2015). In these two tasks, the monolingual baseline of Kiros et al. (2014) is the worst perfoming model, and the correlations with human judgements monotonically increase as we increase β from 1.0 to 0.25. In all other tasks, there is no monotonic relation between the value of β and the human judgements.
In general, results on general domain similarity tasks are mixed, e.g. answers or headlines, and both MME and MLMME show weak correlation with human judgements. It is noteworthy that all models, baseline and multilingual, perform far worse than the best corresponding SemEval model in virtually all general-domain tasks (see entries marked with † in Table 5.3). Only once one configuration of one of our models remained competitive according to the state-of-the-art, and that was our multilingual model with β = 0.5 in the Europarl SMT task (differences < 10% compared to the best performing model). When we consider only the general-domain similarity tasks, the monolingual baseline of Kiros et al. (2014) has a higher Pearson rank correlation about 54% of the time, i.e. our model performs better about 46% of the time.