• No se han encontrado resultados

Las enmiendas constitucionales de 1980 y el endurecimiento del régimen

V. Notas sobre la trascripción de nombres y términos árabes

2. POLÍTICA Y SOCIEDAD CIVIL EN EL EGIPTO DE LA SEGUNDA MITAD DEL S XX Y COMIENZOS DEL XXI: DEL NASERISMO A MUB RAK DEL S XX Y COMIENZOS DEL XXI: DEL NASERISMO A MUB RAK

2.1.2. Funcionamiento del sistema bajo el naserismo 1 La sociedad civil bajo el régimen naserista

2.2.1.6. Las enmiendas constitucionales de 1980 y el endurecimiento del régimen

Our challenge is to show that, according to human judgements, the gap between Same- domain Pairs and Cross-domain Pairs is consistent to a statistically significant level over all of the very similarly scored pairs. Our technique for doing this is to focus on the overlap of pairs that have very similar relatedness scores.

Figure 6.1 shows that the overall relatedness scores assigned by the automatic measure jcn is higher for same-domain than for cross-domain pairs. The horizontal gray line shows the number of pairs for which significance in our results can be reflected. This number decreases for cross-domain pairs for the values over 0.4, and it does not increase ahead in the scale; for this reason, we do not evaluate our hypothesis for high automatic scores of semantic relatedness. This is expected in general: concepts from different domains are overall likely to be less related than concepts from the same domain.

To show our specific hypothesis, however, we isolate intervals that contain clusters of very similarly metric-assigned scores and show that, even for these pairs where all of them are scored the same for a metric (in this case, jcn), the human judges consistently rate the same-domain pairs as more related. For example, pair Tax -Investment from the domain Economy has a score of 0.133 under this measure, while Gamble-Picnic from domains Economy and Food respectively has a nearly-similar relatedness value of 0.135. However, human judges consistently score Tax -Investment to be more related than Gamble-Picnic. We will show that this is a statistically significant pattern for all the automatic metrics associated in WordNet to less synsets.

5

0 1 2 3 4 5 Automatic Relatedness Score

Number of P airs (log10) 0 1 2 3 4 5 6 same−domain pairs cross−domain pairs significance level

Figure 6.1: Distribution of same- and cross-domain pairs using the jcn measure and concepts from exclusive domains.

considered. In other words, that within nearly-similar relatedness scored clusters, the human judges score separate the pairs to a statistically significant degree, whereas the automated metrics do not. We show this focusing on a subset of measures that obtained a high correlation to human judges in the previous chapter. First, we describe two sources of evaluators used for this experiment: the Web interface used in Chapter 5, and the crowd-sourcing platform

CrowdFlower6

6.4.1 Obtaining Judgements using the Web interface

To validate our hypothesis, we used the same Web interface described in Chapter 5 to collect human assessments of relatedness for the pairs in the dataset described above. Using this survey, we collected judgements from 80 judges, who assessed 40 pairs with every survey. In addition, we crowd-sourced judgements via CrowdFlower7, as described next.

6.4.2 Obtaining Judgements using CrowdFlower

CrowdFlower is a Web platform, similar to Amazon Mechanical Turk, that allows uploading tasks to be performed by a community of human users. This permits requesters (i.e. the group requiring tasks to be performed) to post their jobs to a community of workers (i.e. the

6

Both ways for collecting assessors were approved by the RMIT College of Science, Engineering and Health Human Ethics Advisory Network with the identifier A&BSEHAPP93-10. Due to the remuneration mechanism in crowd-sourcing platforms, this application was amended to consider their use.

7

people that will perform the tasks). Tasks, or Human Intelligence Tasks (HITs) may include activities such as labelling or classifying that require human judgement to be performed. This model for collecting judgements has grown in popularity for small tasks [Schnoebelen and Kuperman, 2010; Oleson et al., 2011; Nikolova et al., 2012]. In our setting, each task consists of assessing a collection of pairs for semantic relatedness. Workers perform HITs for a small amount of money per task, which is paid once workers demonstrate their reliability via an acceptable level of trust.

To assess the trust in a worker’s output, requesters may include, along with their HITs, a set of tasks with a known and true answer, called gold items [Oleson et al., 2011]. Should a gold item be assessed incorrectly, the trust that the platform has on that worker is com- promised. Depending on the number of assessments provided by a worker, the final value of trust for this worker will determine whether these assessments can be trusted or not. The trust is calculated as the ratio of gold items answered correctly; if such a value exceeds a threshold (of 75%), the worker is paid.

To migrate our experimental setting to CrowdFlower, we used the tools provided by the API of this platform. We uploaded four jobs, one for each survey from the Web interface used in Chapter 5 (totalling 120 pairs), and a final job containing the remaining 202 pairs. For each job, we included a set of gold questions constructed in two ways:

• additional pairs where simple cases of relatedness were displayed: (a) pairs contain- ing exactly the same term in both sides, for instance Human-Human; and (b) pairs displaying meaningless strings, for example Sklbmkd-Ejigrnwe.

• existing pairs from the Subset with Cross-domain Pairs where user agreement from the Web interface experiment was high (see Section 6.4.2); more specifically, where relatedness was deemed between the middle (2) and the highest value (4) for all assessors of that pair (i.e. the region of positive perceived relatedness).

We collected judgements from 159 CrowdFlower workers, who could judge perceived relatedness of pairs for as long as their level of trust score permitted them to. In total, for both interfaces, 12,336 assessments were collected for all the pairs in the dataset, all distributed in such a way that at least 25 judgements were collected for each pair. The total number of unsure votes accounted for 0.6% of all the judgements received during the experiment; these votes were discarded from the average scores.

6.5 Results

Recall that the hypothesis of this experiment is that, for a set of Same-domain Pairs and Cross-domain Pairs , even where an automated measure of semantic relatedness assigns them

very similar scores, human judges assess the former type of pairs as significantly more related than the latter to a significant extent. This way, we show that domain information presents some effect in the measurement of semantic relatedness by “boosting” this measurement for pairs from within the same domain or “penalising” pairs from different domains. The results obtained in this experiment are described in the following sections. We analyse semantic relatedness between pairs with respect to a domain via the following classification: Same-domain Pairs (SDP ), Cross-domain Pairs (XDP ), and a special case termed Wikilink Cross-domain Pairs (WXDP ), which are Cross-domain Pairs connected by a wikilink.

6.5.1 Analysis by Categories

In order to analyse the results obtained after the dataset was labelled, we separated pairs according to their placement with respect to a domain, that is, in two subsets: SDP and XDP. We considered separately the WXDP subset. As an initial investigation, we confirm the expectation that Same-domain Pairs are overall more related than Cross-domain Pairs.

6.5.1.1 Analysis of Same- and Cross-domain Pairs

We obtained the average scores assigned over all pairs in the dataset, and used box plots to show their distribution, as shown in Figure 6.2. We compared whether the difference between these subsets is significant using a Wilcoxon rank-sum test. Recall that this test determines whether the scores assigned to two different populations measures are similarly distributed. The test reported a significant difference (p < 0.01) between the subsets of Same-domain Pairs and Cross-domain Pairs, for an average value of the subset pairs of 2.30(±0.01) and 0.98 respectively. This result is unsurprising: as expected, concepts from different domains are intuitively less related than concepts from the same domain.

6.5.1.2 Analysis by Pair Classification

We repeated the analysis performed above, this time by subdividing the subset of Cross- domain Pairs in two, given the existence of wikilinks between concept pairs. The distribution of the scores over these subsets is shown in Figure 6.3. We first compared the three subsets to note whether their scores as a group are significantly different using a Kruskal-Wallis rank-sum test, as this test allows us to detect differences between more than two groups measured under the same scale. This test reported that humans deem on average these three subsets as different (p < 0.01), with averages SDP = 2.22 (±0.1),¯ XDP = 0.70 (±0.05) and¯

¯

WXDP = 2.00.

● ● ● ● Human judgements Type of pair

Relatedness score (human)

same−domain cross−domain 0 1 2 3 4

Figure 6.2: Distribution boxes of average scores deemed by humans for same-domain and cross-domain pairs.

test between each pair of groups. These tests reported significant differences between the subsets SDP and XDP, as well as between XDP and WXDP (p < 0.01). However, for Same-domain Pairs and Wikilink Cross-domain Pairs, assessors did not consider them to be significantly different (p = 0.45). This means that pairs of concepts from different domains tend to be scored lower than pairs from within the same domain, unless they are connected via a wikilink relation. When this happens, the difference between these pairs and pairs from the same domain is not statistically significant.

A similar effect with wikilinks was detected in the exploratory study conducted in the previous chapter. There, we found that pairs of concepts sharing a domain that are connected via wikilinks are perceived by humans as related as those pairs of terms referring to the same concept. While wikilinks have been relevant components of Wikipedia-based measures of semantic relatedness (e.g. wlm [Milne and Witten, 2008] and raco [Grieser et al., 2011]), their influence in semantic relatedness, to our knowledge, had not been demonstrated prior to this study.

From the plot shown in Figure 6.3 and by comparing the distribution boxes in Figure 6.4, we note that scores assigned by automatic relatedness measures do not distribute similarly to human assessment. The only exception to this is the Concept-based Normalised Web Relatedness measure using Wikipedia (nwrc); this reinforces the correlation results obtained in the exploration conducted in the previous chapter, where this measure scored the highest correlation with human judgements.

Human judgements

Type of pair

Relatedness score (human)

same−domain cross−domain wiki−cross−domain

0 1 2 3 4

Figure 6.3: Box plots of average scores deemed by judges for same-domain, cross-domain and wikilink cross-domain pairs.

Documento similar