La Soberanía Alimentaria en los Planes Directores de Cooperación

4. PRESENTACIÓN DE LOS ACTORES Y CARACTERISTICAS

5.1 Las convocatorias

5.1.2 La Soberanía Alimentaria en los Planes Directores de Cooperación

To examine the varying effects of features on the credibility of information depending on topic, each dataset (Yahoo! Answers and Yelp) was separated into two groups: one with general topics and another with specific topics. A feature ablation study was continued on each group. The Freidman test and post hoc tests were performed as in the previous section. All the

experiments in this section utilized features (refer to Table 37 and Table 41) selected by the stepwise feature selection method performed in the previous section.

5.4.1 Yahoo! Answers

The performance of the best model (refer to Table 37) by the stepwise selection based on AIC is summarized in Table 44. I trained and evaluated models using 10-fold cross-validation. The performance measures reported here are averaged average precision across 10 iterations. In general, predicting the credibility of health-related answers about a general topic seems easier than predicting the credibility of health-related answers about a specific topic. Average precision on the general topic was higher than average precision on the specific topic. Interestingly,

assessors had a lower agreement on the general topic than on the specific topic. This contradicts the results in which we found a correlated relationship between the classifier's performance and the assessors' agreement. Potential reasons for this finding will be covered in the discussion section.

Table 44. Performance of Credibility Classifiers over Topic (Yahoo! Answers)

Topic AP Fleiss’ Kappa

Specific 0.615 0.361

General 0.694 0.323

Statistical comparisons of classifiers were performed using the Friedman test. The results showed statistically significant differences among classifiers on both topics (Table 45). Thus, post hoc comparisons of classifiers were conducted to examine which difference(s) is/are significant by using the paired Bootstrap-Shift test. The Wilcoxon signed-rank test was not used because differences in absolute values were potentially ignored, although relatively big

differences were found in multiple folds in the previous experiments. Both tests showed the same result regarding statistically significant feature categories. Holm’s modification for the

confidence level was applied to post hoc tests.

Table 45. Results of Friedman Test over Topic (Yahoo! Answers)

Topic Specific General

Friedman

Chi-squared 16.586 25.763

p-value 0.011* 0.000*

Results from the post hoc tests are presented in terms of p-values of average precision (Table 46). The rows labeled “-X” show the p-value of the Bootstrap-Shift test between the

control model using all features and a treatment model using all features except those in feature category “X”. The letter in parentheses indicates whether the feature category “X” is a top-level category (t) or sub-level category (s) in the feature hierarchy. The superscripted number in the upper left of p-values represents the rank of the p-value for each model of that topic. This ranking is made from the ascending order of the original p-values by each topic and average precision. The p-values modified by the Holm’s procedure are reported.

Table 46. Bootstrapped Feature Ablation Study Results over Topic (Yahoo! Answers)

Specific General

Feature group AP Percent

change p-value AP Percent change p-value All 0.615 0.694 - Comprehensiveness (s) 0.574 -6.67% 1_0.000* _0.591 _-14.84% 1_0.000* - Content Informativeness (t) 0.579 -5.85% 20.002* 0.575 -17.15% 10.000* - Expertise (s) 0.608 -1.14% 40.904 0.685 -1.3% 50.500 - Sentiment (t) 0.614 -0.16% 60.47692 0.721 +3.89% 40.451 - Relevance (s) 0.619 +0.65% 50.615 0.703 +1.3% 30.195 - Readability (s) 0.635 +3.25% 30.959 0.687 -1.01% 60.2847

The difference in performance among classifiers in the general topic appears to be higher than the difference in performance among classifiers in the specific topic. The p-values in the same rank are generally small in the general topic. In the case of content informativeness and comprehensiveness category features, both showed statistically discriminative power in

predicting credibility. Other feature categories such as sentiment, presentation, and plausibility did not make a difference in predicting credibility.

5.4.2 Yelp

The performance of the best model (refer to Table 41) by the stepwise selection method based on the AIC is summarized in Table 47. I trained and evaluated models using 10-fold cross- validation. The performance measures reported here are averaged average precision across 10 iterations. In general, predicting the credibility of health-related questions about a specific topic seems much easier than predicting the credibility of health-related questions about a general topic. Average precision on the specific topic was much higher than average precision on the general topic. Assessors had a higher agreement on the specific topic than the general topic.

Table 47. Performance of Credibility Classifiers over Topic (Yelp)

Topic AP Fleiss’ Kappa

Specific 0.501 0.228

General 0.277 0.121

Statistical comparisons of classifiers were performed using the Friedman test. The results showed statistically significant differences among classifiers only on specific topic (Table 48). Thus, post hoc comparisons of classifiers were conducted to examine which difference(s) is/are significant by using the paired Bootstrap-Shift test only with the specific topics. The Wilcoxon signed-rank test was not used for the same reasons as in Yahoo! Answers. Holm’s modification for the confidence level was applied to post hoc tests. Results from the post hoc tests are

Table 48. Results of Friedman Test over Topic (Yelp)

Topic Specific General

Friedman

Chi-squared 13.195 10.523

p-value 0.04* 0.104

Table 49. Bootstrapped Feature Ablation Study Results over Topic (Yelp) Specific

Feature group AP Percent

change p-value All 0.501 - Content Informativeness (t) 0.373 -25.55% 2_0.345 - Sentiment (t) 0.422 -15.77% 10.192 - Comprehensiveness (s) 0.475 -5.19% 41.000 - Specificity (s) 0.480 -4.19% 51.000 - Expertise (s) 0.498 -0.6% 30.84 - Readability (s) 0.563 +12.38% 51.000

It is difficult to compare the difference of performance among classifiers over topics because post hoc tests were not conducted with the data in the general topic group. All category features were not discriminative in predicting credibility. Despite a relatively high percentage drop in the content informativeness and sentiment feature categories, no statistically significant effect was found. Surprisingly, readability features were found to have a negative impact on the credibility model. When the features related to readability were removed, the predictiveness was improved by 12.38%.

5.4.3 Discussion

Depending on the type of social media, the type of topic that was easy for predicting credibility was different. Whereas the general topic was easy to predict on Yahoo! Answers, the specific topic was easy to predict on Yelp. If specificity is imagined on a spectrum of specific to general, Yahoo! Answers content is closer to the “specific” end of the spectrum and Yelp content is closer to the “general” end. Specific answers in Yahoo! Answers tend to be much more

specific than specific reviews on Yelp. Specific reviews on Yelp are closer to general answers on Yahoo! Answers, and general reviews on Yelp are much more general than any answers on Yahoo! Answers. To help easier understanding, the degree of specificity of health information on Yahoo! Answers and Yelp is illustrated Figure 6. Health-related questions may require a very detailed explanation depending on the topic, whereas reviews on a medical facility have certain limitations to be very specific. For specific reviews, it's easier for someone to “imagine” the information needs someone might use the review for, and therefore it is easier to judge credibility.

Figure 6. A Spectrum of Topical Specificity of Datasets

However, another phenomenon was found regarding the level of agreement of credibility assessors in the Yahoo! Answers dataset. Generally, the higher the agreement of the assessors, the higher the performance of the classifier. However, this was not true in Yahoo! Answers. The agreement of the credibility assessors was high in the specific topic and low in the general topic,

but the average precision was the opposite. In other words, the agreement between assessors was higher as the topic became more specific. The performance of classifiers showed a similar tendency as the topic became more specific. However, for the most specific topic which is the specific topic of Yahoo! Answers, predicting credibility became harder. Perhaps the specificity of the topic itself does not directly affect the performance of the classifier. Plausibility was the most influential platform-specific factor of Yahoo! Answers in the F-tests but was not selected in the feature selection. This credibility factor is likely to be related to the topic specificity.

As being specific in Yahoo! Answers might mean being very specific, features in this dissertation may not capture everything they were meant to. Among the content informativeness category features, only the comprehensiveness category features (e.g., word count) had a

statistically significant effect in Yahoo! Answers. The significant effect appeared in both specific and general topic. However, in the case of the general topic, features might be good enough to capture other aspects (e.g., plausibility, relevance, and specificity) of the content informativeness. The percent drop of AP without the content informativeness features was larger in the general topic (-17.15%) than in the specific topic (-5.85%). These results suggest the potential limitations of the quality of features that operationalize other factors in the content informativeness. Thus, a closer examination of these features is necessary.

In the feature ablation study utilizing the entire Yelp dataset, the sentiment category features and content informativeness features were discriminative in predicting credibility. However, in this experiment dividing the data by topic, the results showed that no category features were not discriminative in both topics. This means there is a particular pattern of influence by the sentiment and content informativeness category features when looking at the data as a whole. However, it is possible that the effect of this pattern is weakened or diverged

when the data are separated. It is necessary to use a larger number of data to closely observe whether the same result will happen. The smaller the amount of data, the higher the impact of instances with a random pattern. The Yelp dataset might be more sensitive to the effects of this random pattern because it has more bias in the distribution of credibility labels than the Yahoo! Answers dataset. It should be noted that features related to content informativeness were found to be influential in both Yahoo! Answers and Yelp.

In document Visibilizando buenas prácticas para la construcción de una Cooperación para la Soberanía Alimentaria (página 104-108)