3. RESULTADOS Y DISCUCIONES
3.1 Dinámica de fluidos, comparación entre la simulación y lo experimental
The expressions derived in the previous subsection help us in understanding the behavior of crowd- sourcing systems. One can define an ordering principle for the quality of crowds in terms of the quality of their distributed inference performance. This is a valuable concept since it provides us a tool to evaluate a given crowd. Such a valuation could be used by the task manager to pick the appropriate crowd for the task based on the performance requirements. For example, if the task manager is interested in constraining the misclassification probability of his/her task to while simultaneously minimizing the required crowd size, the above expressions can be used to choose the appropriate crowd.
Theorem 7.3.4 (Ordering of Crowds). Consider crowdsourcing systems involving crowd C(µ) of
workers with i.i.d. reliabilities with mean µ. Crowd C(µ) performs better than crowd C(µ0) for
classification if and only ifµ > µ0.
Proof. As can be observed from Props. 7.3.1 and 7.3.3, the average misclassification probabilities
depend only on the mean of the reliabilities of the crowd. Therefore, it follows that crowd C(µ) of
workers with i.i.d. reliabilities with mean µ performs better for classification than crowd C(µ0) of
workers with i.i.d. reliabilities with mean µ0as µ > µ0.
Since the performance criterion is average misclassification probability, this can be regarded as a weak criterion of crowd-ordering in the mean sense. Thus, with this crowd-ordering, better crowds yield better performance in terms of average misclassification probability. Indeed, misclas- sification probability decreases with better quality crowds. In this chapter, the term reliability has been used to describe the individual worker’s reliability while the term quality is a description of the total reliability of a given crowd (a function of mean µ of worker reliabilities). For example, for the spammer-hammer model, quality of the crowd is a function of the number of hammers in the crowd, while the individual crowd workers have different reliabilities depending on whether the worker is a spammer or a hammer.
Proposition 7.3.5. Average misclassification probability reduces with increasing quality of the crowd.
Proof. Observe from Props. 7.3.1 and 7.3.3 for coding- and majority-based approaches, respec-
tively, that the average misclassification probability is a monotonically decreasing function of the mean of reliabilities of the crowd (µ). This value µ serves as a quality parameter of the crowd and, therefore, average misclassification probability reduces with increasing quality of the crowd.
To get more insight, a crowdsourcing system with coding is simulated as follows: N = 10 workers take part in a classification task with M = 4 equiprobable classes. A good code matrix A is found by simulated annealing [166]:
A = [5, 12, 3, 10, 12, 9, 9, 10, 9, 12]. (7.15)
Here and in the sequel, code matrices are represented as a vector of M bit integers. Each integer
rj represents a column of the code matrix A and can be expressed as rj =
PM −1
l=0 alj × 2
l. For
example, the integer 5 in column 1 of A represents a01= 1, a11= 0, a21= 1 and a31= 0.
Consider the setting where all the workers have the same reliability pj = p. Fig. 7.2 shows the
probability of misclassification as a function of p. As is apparent, the probability of misclassifica- tion reduces with reliability and approaches 0 as p → 1, as expected.
Now the performance of the coding-based approach is compared to the majority-based ap- proach. Fig. 7.3 shows misclassification probability as a function of crowd quality for N = 10 workers taking part in an (M = 4)-ary classification task. The spammer-hammer model, where spammers have reliability p = 1/M and hammers have reliability p = 1, is used. The figure shows a slight improvement in performance over majority vote when code matrix (7.15) is used.
Now consider a larger system with increased M and N . A good code matrix A for N = 15 and M = 8 is found by cyclic column replacement:
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Reliability Misclassification probability
Fig. 7.2: Coding-based crowdsourcing system misclassification probability as a function of worker reliability 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 10−0.5 10−0.4 10−0.3 10−0.2 Quality of crowd Misclassification probability Coding approach Majority approach
Fig. 7.3: Misclassification probability as a function of crowd quality using coding- and majority- based approaches with the spammer-hammer model, (M = 4, N = 10).
The code matrix for the system with N = 90 and M = 8 is formed sub-optimally by concate- nating the columns of (7.16) six times. Due to the large system size, it is computationally very expensive to optimize for the code matrix using either the simulated annealing or cyclic column replacement methods. Therefore, we concatenate the columns of (7.16). This can be interpreted as a crowdsourcing system of 90 crowd workers consisting of 6 sub-systems with 15 workers each which are given the same task and their data is fused together. In the extreme case, if each of these sub-systems was of size one, it would correspond to a majority vote where all the workers are posed the same question. Fig. 7.4 shows the performance when M = 8 and N takes the two val- ues: N = 15 and N = 90. These figures suggest that the gap in performance generally increases for larger system size. Similar observations hold for the beta model of crowds, see Figs. 7.5 and 7.6. Good codes perform better than majority vote as they diversify the binary questions which are asked to the workers. From extensive simulations, we found that the coding-based approach is not very sensitive to the choice of code matrix A as long as we have approximately equal number of ones and zeroes in every column. However, if we use any code randomly, performance may de- grade substantially, especially when the quality of crowd is high. For example, consider a system consisting of N = 15 workers performing a (M = 8)-ary classification task. Their reliabilities are drawn from a spammer-hammer model and Fig. 7.7 shows the performance comparison be- tween the coding-based approach using the optimal code matrix, majority-based approach and the coding-based approach using a random code matrix with equal number of ones and zeroes in every column. It can be observed that the performance of the coding-based approach with a random code matrix deteriorates for higher quality crowds.
Experimental ResultsFor Real Datasets
In this section, the proposed coding- based approach is tested on six publicly available Amazon Mechanical Turk data sets—quantized versions of the data sets in [130]: the anger, disgust, fear, joy, sadness and surprise datasets of the affective text task. Each of the data sets consist of 100 tasks with N = 10 workers taking part in each. Each worker reports a value between 0 and 100,
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 10−2
10−1
100
Quality of crowd
Misclassification probability Coding approach (N=15)
Majority approach (N=15) Coding approach (N=90) Majority approach (N=90)
Fig. 7.4: Misclassification probability as a function of crowd quality using coding- and majority- based approaches with the spammer-hammer model, (M = 8).
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10−0.4 10−0.3 10−0.2 10−0.1 β Misclassification probability Coding Approach Majority Approach
Fig. 7.5: Misclassification probability as a function of β using coding- and majority-based ap- proaches with the Beta(α = 0.5, β) model, (M = 4, N = 10).
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10−3 10−2 10−1 100 β Misclassification probability Coding Approach (N=15) Majority Approach (N=15) Coding Approach (N=90) Majority Approach (N=90)
Fig. 7.6: Misclassification probability as a function of β using coding- and majority-based ap- proaches with the Beta(α = 0.5, β) model, (M = 8).
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 10−4 10−3 10−2 10−1 Quality of crowd Misclassification probability
Coding approach − Optimal codes Majority approach Coding approach − Random codes
Fig. 7.7: Misclassification probability as a function of crowd quality using optimal code matrix, random code matrix for coding-based approach and majority approach with the spammer-hammer model, (M = 8, N = 15).
Table 7.1: Fraction of errors using coding- and majority-based approaches
Dataset Coding-based approach Majority-based approach
Anger 0.31 0.31 Disgust 0.26 0.20 Fear 0.32 0.30 Joy 0.45 0.47 Sadness 0.37 0.39 Surprise 0.59 0.63
and there is a gold-standard value for each task. For the analysis, the values are quantized by dividing the range into M = 8 equal intervals. The majority -based approach is compared with the proposed coding-based approach. A good optimal code matrix for N = 10 and M = 8 is designed by simulated annealing [166]:
A = [113, 139, 226, 77, 172, 74, 216, 30, 122]. (7.17)
Table 7.1 compares the performance of the coding- and majority-based approaches. The values in Table 7.1 are the fraction of wrong decisions made, as compared with the gold-standard value. As indicated, the coding-based approach performs at least as well as the majority-based approach in 4 of 6 cases considered. The gap in performance is expected to increase as problem size M and crowd size N increase. Also, while it is true that the coding-based approach is only slightly better than the majority approach in the cases considered in Table 7.1, this comparison only shows the benefit of the proposed coding-based approach in terms of the fusion scheme. The datasets contain data for tasks where the workers have reported continuous values and, therefore, it does not capture the benefit of asking binary questions. This aspect is a major benefit of the proposed coding-based approach whose empirical testing is yet to be carried out.