3.2.3 Síntesis cualitativa: Esquemas DAFO
DAFO Nº 3 – CARLOS
2) INDICADORES DE SUSTENTABILIDAD SOCIOAMBIENTAL
While our results show any individual contributor is imperfect (as should be expected), extensive work in crowdsourcing has focused on how groups can collectively outper- form individuals. In this section, we outline methods for improving overall precision and overall recall given multiple crowd workers by aggregating workers and then clustering facts sharing similar content. We show the increase in performance along both of these measures using the workers’ responses collected above.
3.7.1
Improving Recall with Aggregation
Since individual performance in terms of recall is just over 40% on average, we explore whether combining workers into groups would improve performance, because having more contributors increases the chance to bring in more relevant information.
Category TimeSelect FocusedSelect A 7 0 B 13 0 C 5 6 D 27 15 E 15 6 F 3 0 Total facts 99 66 True positives 29 39 Precision 0.39 0.63 Recall 0.30 0.45
Table 3.3: Worker-generated facts for Topic 1 in the initial “TimeSelect” condition and the “FocusedSelect” condition. By providing focused queries to the workers, we are able to capture more true positives with fewer overall facts, as well as reduce other categories of worker errors.
We measure the effect of aggregation on worker summaries by calculating average precision and recall across all possible combinations of each group size from 2 to 10. This provides a more robust measure of group performance that is less tied to the specific members’ performance. The results show that recall increases steadily as more workers are added to each group. Figure 3.2 shows that, on average, we need 5 workers to exceed 90% recall. This implies that each additional worker brings in new information that allows for improved recall, meaning that the diversity of different responses is high.
However, as can be expected, precision does not increase when additional workers are added, since adding more people also increases the chance of adding noise to the system. To address this issue, we developed two clustering methods to investigate the impacts of workers’ agreements on fact precision, which will be discussed below.
3.7.2
Improving Precision Through Voting
When combining workers into different group sizes, rather than aggregating facts indis- criminately across all input, we use an agreement or voting scheme based on the idea that the facts more workers agree on are more likely to be true positive.
Agreement-based Clustering
If there are facts that contain similar content, we can cluster them and find “representa- tive” facts for that cluster. We devise two methods for that agreement-based clustering:
1. Clustering based on worker summaries (Word clustering):
In this method, we consider two facts as “share similar content” if at least half of the facts share similar words. We define two words as similar if the Levenshtein distance between them is less than or equal to 2.
2. Cluster based on selected lines (Line clustering):
In this method, we cluster two facts together if the workers select the exact same lines from the dialog when creating those facts. This clustering method uses the raw dialog lines and does not use the worker generated summaries.
Facts assigned to the same cluster are considered as a single fact in measurement and represented by the most frequent label (true/false positive) from members of the cluster. We then calculate precision and recall based on three levels of workers’ agreement: 1) any agreement, in which at least 2 workers agree on a fact; 2) majority agreement, in which at least half of the workers in the group agree on a fact; and, 3) unanimous agreement, in which all workers agree on the fact.
Clustering Results
Figure 3.4 shows the precision, recall, and F1 score for both clustering methods. Unlike in the no clustering case from before (in Figure 3.2), when we add one more worker and take into account similarity, precision drastically improves: in both the Word and Line cluster graphs, we see precision increase past 80% compared to the 44% individual worker baseline. However, there is a drastic drop in recall in both cluster conditions, as it drops from 42% to below 10%.
Furthermore, we see that, as we increase worker group size, precision for the “any” agreement condition decreases as expected (suggesting that workers agree on both rele- vant and irrelevant facts), whereas recall increases (there is greater potential for any two workers to agree on a fact that is a true positive), although this percentage is still lower than the without-clustering recall seen in Figure 3.2. However, this trend changes when we look at the “majority” and “unanimous” agreement levels.
Figure 3.4: Precision and Recall by the number of workers based on Clustering the words in the worker summaries (left) and clustering the raw dialog lines (right). We find that “any” agreement performs the best for recall, but precision suffers; on the other hand, unanimous agreement leads to 100% precision, but to low recall. This implies that addi- tional workers bring in new information with respect to recall, and tend to agree with other workers with respect to precision. We also find that worker summaries are self- consistent enough with each other to offer better clustering performance.
BothWord and Line clustering methods show similar precision and recall change trends, where precision increases but recall decreases. For instance, for “unanimous” agreement, we achieve an expected value of precision of 100% with teams of five workers. We see that “majority” agreement levels trend upward towards the “unanimous” agree-
ment case as worker group size rises past 4. This implies that the facts more workers agree on are facts that are more likely to be found in the ground truth (i.e., be relevant); on the other hand, clustering decreases the likelihood that all relevant facts are covered with agreement from workers.
Finally, we see that Word clustering performs better than Line clustering (see F1 score); this implies that worker summaries are consistent enough that they contain more information than the original dialog lines do. Despite having a small dataset compared to those typically used in Natural Language Processing applications, we are able to make meaningful comparisons between worker generated summaries and achieve better preci- sion results.