V. Notas sobre la trascripción de nombres y términos árabes
1. ANTECEDENTES POLÍTICOS DEL EGIPTO CONTEMPORÁNEO: DE MU AMMAD AL A LA GUERRA ÁRABE-ISRAELÍ DE 1948 MU AMMAD AL A LA GUERRA ÁRABE-ISRAELÍ DE
1.3. Del Egipto liberal al proyecto unitario
1.3.4. Egipto durante la II Guerra Mundial y la posguerra
In this section we describe several datasets that have been employed in the evaluation of automatic measures of semantic relatedness. Datasets for semantic analysis are produced by selecting pairs of words and asking assessors to rate these pairs given a proposed scale with respect to how similar or related they are perceived to be. A dataset then, includes a set of pairs along with the average score assigned by users to these pairs to indicate a degree of relatedness. One important challenge of constructing datasets is the complexity of the process, not only in their construction but also on the collection of assessments [Budanitsky and Hirst, 2006]. Moreover, datasets have to be reliable and contain assessments that persist over time. These factors have not represented an issue at all; for instance, Rubenstein and Goodenough [1965] reported a Pearson correlation of ρ = 0.99 between assessors for the same experiment conducted at two different times, while Finkelstein et al. [2002] reported a correlation between assessors of ρ = 0.95 from a subset of Miller and Charles [1991] that was included into WordNet-353.
Using a dataset to validate automatic measures of semantic relatedness is just one of three approaches suggested by Budanitsky and Hirst [2006]. The other two approaches are: (a) describing and comparing the mathematical soundness and principles of the measure pro- posed against others; and (b) evaluating the measure proposed against a particular NLP task, such as: word sense disambiguation [Budanitsky and Hirst, 2006], detection of malapropisms [Hirst and St-Onge, 1998], or coreference resolution [Ponzetto and Strube, 2007b]. At the start of this section we summarised the mathematical foundations of existing measures; how- ever, this evaluation does not allow us to discern the adequacy of these measures for our task. Before this evaluation, we compare automatic measures of semantic relatedness under a word similarity setting (see Chapters 5 and 6). For this, we describe below four testbeds that have been constructed for assessing semantic similarity and relatedness between words: Rubenstein and Goodenough [1965], Miller and Charles [1991], WordNet-353 [Finkelstein et al., 2002] and Klebanov and Shamir [2006]. However, as we discuss in Chapter 5, none of these datasets study the possibility of words sharing a domain in common, which leads us to construct our own datasets for investigating this effect (see Chapters 5 and 6).
2.4.2.1 Datasets for Analysing Semantic Similarity
We describe here two testbeds constructed for the study of similarity between words: Ruben- stein and Goodenough [1965] and Miller and Charles [1991]. While other testbeds have been employed in the literature (like the 80-TOEFL [Landauer and Dumais, 1997], 50-ESL [Tur- ney, 2001] and the 300-Reader’s Digest Word Power Game, as indicated by Jarmasz and Sz- pakowicz [2003]), they are not covered in this thesis. This is based on two observations made by Ponzetto and Strube [2007b]: first, they do not explicitly consider relatedness of words but only their similarity (i.e. based only on hierarchical relations); and second, these datasets contain verbs. Through this thesis, we make use of WordNet and Wikipedia; the challenge of analysing verbs or words with other grammatical roles is that these can be either represented in different classifications from hierarchical, or simply disregarded (e.g. Wikipedia hardly features verb-related articles). The testbeds that are relevant to this thesis are described below.
Rubenstein and Goodenough’s dataset. The authors produced a dataset focused on
evaluating “similarity of meaning” between words. In this experiment, 51 individuals rated 65 pairs of words according to their similarity. It has to be noted that words contained in the dataset were uni-grams. Participants in the experiment were asked to perform two tasks regarding this dataset: first, ordering a deck of cards where each pair is represented by a card (as in a ranking); and second, assessing a value for pairs on a discrete scale from 0 to 4, where 0 meant that the words were “dissimilar” and 4 that they were “similar”. Given that this experiment was the first of its kind, one important outcome of the dataset produced was that similarity of words is maintained through time. This dataset has been employed in experiments to evaluate measures of semantic similarity and relatedness [Jarmasz and Szpakowicz, 2003; Gurevych, 2005; Budanitsky and Hirst, 2006; Strube and Ponzetto, 2006; Milne and Witten, 2008; Wubben and van den Bosch, 2009; Cramer et al., 2012].
Miller and Charles sub-dataset. From the dataset proposed by Rubenstein and Good-
enough [1965], Miller and Charles [1991] selected 30 pairs according to the average values deemed by assessors. Specifically, 10 pairs were selected for each level of similarity achieved: the higher level (i.e. those pairs that scored an average between 3 to 4), the intermediate level (1 − 3), and the lower level (0 − 1). Judgements from 38 subjects were recollected on a similar 5-point scale. This dataset is sometimes preferred to the one of Rubenstein and Goodenough [1965] as it features clearly distinct groups. It has been employed to evaluate measures [Resnik, 1995; Jiang and Conrath, 1997; Gracia and Mena, 2008], as well as in stud- ies of similarity and relatedness [Jarmasz and Szpakowicz, 2003; Gurevych, 2005; Budanitsky
and Hirst, 2006; Ponzetto and Strube, 2007a; Milne and Witten, 2008; Wubben and van den Bosch, 2009].
2.4.2.2 Datasets for Analysing Semantic Relatedness
One important drawback of the datasets described above is the lack of pairs featuring other types of relations apart from hierarchical, as noted in an extensive study of WordNet-based
similarity measures [Budanitsky and Hirst, 2006]. In their study, Budanitsky and Hirst
[2006] detected that the majority of pairs contained in these datasets have either a synonymy relationship or a direct parent-children hierarchical relation in WordNet. This, in many of the pairs available, hinders the possibility of considering different relationships between words. For these reasons, two recent datasets have been constructed under the premise of covering semantically related pairs: WordNet-353 [Finkelstein et al., 2002] and Klebanov and Shamir [2006].
WordNet-353. In order to develop a search engine that constructed a context from
reference texts, Finkelstein et al. [2002] constructed the WordNet-353 dataset. This dataset contains 353 pairs of words, which includes the pairs from the dataset of Miller and Charles [1991]. Despite the name of this dataset, 82 of these pairs contain at least one term not available in WordNet 1.6; this has been corrected in recent versions, where only 8 pairs cannot be assessed using WordNet.
As in previous datasets, an assessment of relatedness was conducted by 16 subjects on a 10-point scale. Agirre et al. [2009] divided this dataset into two subsets; one partition is used to test relatedness, and the other partition to evaluate similarity. Commonly, this dataset is preferred when the focus of a study is semantic relatedness between words [Strube and Ponzetto, 2006; Gabrilovich and Markovitch, 2007; Yazdani and Popescu-Belis, 2012]. However, as Jarmasz and Szpakowicz [2003] pointed out, the dataset presented several de- ficiencies, in particular: (a) it contains culturally-biased pairs of words (e.g. Arafat-terror ; (b) it also features collocated terms as pairs, like hundred-percent ; and (c) assessors were presented with a 10-point scale presented, which can be considered more difficult to assess in comparison to a 5-point scale used in previous experiments.
Klebanob and Shamir. This dataset was generated from an experiment performed by
Klebanov and Shamir [2006] on lexical cohesion of terms in texts. The researchers provided 22 subjects with a set of 10 texts; after reading them, subjects were presented with a list of unique words of text in order of appearance, and were asked to annotate for each word the words that appeared thus far that were related to the current word (see Table 2.2).
Original text: “Mother died today. Or maybe yesterday; I can’t be sure. The telegram from the Home says YOUR MOTHER PASSED AWAY FUNERAL TOMORROW”
List of words with annotations
mother the
died telegram →died
today from #
or home →mother
maybe says
yesterday →today your #
I passed
can’t away
be funeral → passed away, died
sure →maybe tomorrow → yesterday, today
Table 2.2: An example of the annotations performed by a participant for the experiment of Klebanov and Shamir [2006]. Words in bold follow the text structure without repetitions, while words after the right arrows act as associations deemed by a subject. Words with hash characters represent stop words and are removed from the study.
The authors constructed a dataset from the pairs assembled by participants, and scored these pairs with the number of participants that detected it. For example, the pair lamb-dolly was marked by 14 participants, so the score of this pair is 14. This dataset was constructed with relatedness between words in mind, and some studies have used it [Ponzetto and Strube, 2007b]. However, this dataset has two main drawbacks: (a) the usage of not only instances and concepts, but also of other words such as verbs, adjectives and foreign words; and (b) the construction of the dataset itself: it features a very large set of pairs (2, 682 pairs of nouns), but pairs that were not detected by any participant are simply disregarded; therefore it is biased towards the associations that assessors detected in texts.