2. Fundamentos teóricos y estado del arte
2.2 Estado del arte
We use the above-described annotation task on Amazon Mechanical Turk to create several labeled datasets for training and evaluating the classification model that will be described in Section 3.2. These datasets are described below and summarized in Table 12. Each dataset consists of tuples of the form h(w1, t1),(w2, t2)i where wi is a lexical expression–either a
Random Sample of Pairs in PPDB
Our first dataset,PpdbSample, contains a random sample of pairs appearing in the Para- phrase Database. To build PpdbSample, we take a stratified random sample across the six sizes of PPDB (see Section 2.2.5), so as to bias the sample toward good and interesting paraphrases, rather than noisy paraphrases, which would likely dominate if drawn uniformly at random from PPDB-XXXL. We assign a syntactic category to each pair by mapping the syntactic category associated with the pair in PPDB coarsely onto ‘noun’, ‘verb’, ‘adjec- tive/adverb’, or ‘other’. Our sample consists of 22,817 paraphrase pairs: 10,800 lexical paraphrases, 10,126 “one-to-many” paraphrases in which one phrase in the pair is lexical and the other is phrasal, and a small sample of 1,932 phrasal paraphrases. Table 9 shows a sample of pairs from the dataset.
achieve/get, active/formal, appeal/appeal board, boards/executive boards, constitute/fill, cover/cure, dan- ger/grave danger, defence property/military goods, ener- getic/serious, enforcement/running, floor/your word, objectiv- ity/subject, proper operation of the internal market/smooth functioning of the internal market, radioactive materi- als/radioactivity, redo/restore, refuse/revoke, remote/short, space/outer space, steam/this trend, week/last week
Table 9: Random sample of noun pairs in thePpdbSample dataset.
Each pair in PpdbSample was annotated by 3 workers. We take the true label of a pair to be the majority label across workers, breaking ties randomly.
Exhaustive Set of Pairs from RTE Benchmark Data
We design another two datasets which consist solely of paraphrase pairs in PPDB which also appear in established benchmark datasets for the RTE task. The intent of these sets is to test our model on paraphrase pairs that are likely to be “relevant” for RTE systems. Specifically, we intersect PPDB separately with the vocabulary of two benchmark RTE datasets, SICK and RTE2, and refer to the resulting datasets as PpdbSick and PpdbRte2, respectively.
See Section 2.1.3 for a description of these benchmark datasets.
Both of our benchmark datasets consist of pair of sentences, a premisep and a hypothesis
h. We tokenize, POS tag, and parse all of the sentences in each dataset using the Stanford CoreNLP pipeline (Manning et al. (2014)). Given a set of parsed p/h pairs, we select all tuplesh(w1, t1),(w2, t2)isuch that:
1. Both w1 and w2 contain three words or fewer.
2. There is some p/hpair such thatw1 appears in p andw2 appears inh. 3. hw1, w2i appears in PPDB-XXXL.
Tables 10 and 11 show sample pairs from PpdbSickand PpdbRte, respectively. a group/camera, aircraft/an airplane, baby/the little, ball/snowball, band/boy, clear water/water, come/racing, cross/trunk, current/water, edge/sand, event/person, full/milk, group/restaurant, man/talk, person/tail, playing/ride, race/the track, reading/sing, side/stand, surfboard/wall
Table 10: Random sample of noun pairs in thePpdbSick dataset. bill/day, business/talk, community/live, completing/give, construction/propose, court/federal, declaration/be, divi- sion/member, early/israel, economy/more, force/promotion, health/people, in prison/jail, iran/tehran, israeli/leader, is- sue/estate, meeting/representative, organization/response, sen- ator/speak, terrorist/terrorist attack
Table 11: Random sample of noun pairs in the PpdbRtedataset.
The POS tag ti associated with wi is the tag or tag sequence assigned by the parser to wi
in context of the full sentence in which it appeared (either p or h). This means that for these datasets, the same phrase pair might appear multiple times with different POS tags. We allow hw1, w2i to appear with any syntactic category in PPDB, we do not require that it match the category with which it appears in the sentence. Each pair was annotated by 5 workers and we take the true label to be the majority label.
Label Distributions and Annotator Agreement
Table 12 shows the distribution of labels obtained for the pairs in each of the described datasets. Together, the Independent classes (∼and 6∼) constitute the majority of pairs in all three datasets. The PpdbSampledataset contains proportionally fewer Unrelated (6∼) pairs (24%) than do the RTE-filtered datasets. In all three datasets, the Exclusion class (a) is infrequent, in total constituting about 7% of the pairs inPpdbSampleand inPpdbSick, and only 3% inPpdbRte.
≡ @ aalt aopp 6∼ ∼ NA Total
PpdbSample 15% 25% 5% 2% 24% 25% 5% 22,817 3,414 5,695 1,189 397 5,401 5,711 1,051 PpdbSick Train 8% 26% 3% 5% 39% 19% <1% 4,790 394 1,240 136 220 1,871 920 9 Test 9% 26% 4% 3% 39% 19% <1% 5,084 443 1,321 228 147 1,976 956 13 PpdbRte2 Dev 7% 21% 2% 1% 51% 17% 1% 9,299 651 1,945 163 98 4,783 1,548 111 Test 7% 20% 2% 1% 54% 18% 1% 8,835 636 1,776 151 81 4,783 1,603 78
Table 12: Distribution of basic entailment relations appearing in our annotated datasets. These datasets are used for training and evaluating our lexical entailment classifier.
On inspection, we do see that annotators commonly assign pairs to Unrelated (6∼) that ideally would be labeled as Alternatives (aalt). Table 13 provides several examples. Based
on the examples shown, it appears that humans struggle to conceptualize two words as alternatives under a common category when the category is too abstract or too far removed from the words under consideration: e.g. humans to no consider “dog” and “dirt” be be alternatives under the category “thing”. In practice, this error does not seem to translate into errors in the downstream RTE task (Section 3.4), as systems (like humans) are rarely asked to make inferences which hinge on recognizing, for example, that “Fido is a dog” is incompatible with“Fido is dirt”. The relatively frequent presence of these errors, however,
is interesting and may be relevant to future work on lexical entailment in general and on taxonomies in particular.
bank/country, bird/boy, blade/man, confer- ence/police, clothing/hand, dirt/dog, foot- ball/table, gun/kid, man/sky, people/time, water/wood
Table 13: Examples of pairs labeled as Unrelated (6∼) which would have been better labeled as Alternatives (aalt).
The inter-annotator agreement for each dataset, measured using Fleiss’s κ (Fleiss et al. (2013)), is shown in Table 59. Note that κ is lower when label distributions are skewed, since the computation assumes that the probability of randomly choosing a label is equal to that label’s frequency in the dataset. The observed agreement measures support the intuition that lexical entailment annotation is more straightforward when the vocabulary is more concrete. Agreement is highest on thePpdbSickdataset, which is based on image captions and covers a vocabulary of mostly common nouns and simple adjectives (Table 10), and is lower forPpdbSample, which contains paraphrases extracted from a variety of corpora and contains a greater proportion of abstract phrases (Table 9).
κ # Pairs # Annotators PpdbSample 0.20 22,817 599 PpdbSick 0.36 9,874 648 PpdbRte2 0.31 18,134 697
Table 14: Inter-annotator agreement for each of the labelled datasets.