4. PRESENTACIÓN DE LOS ACTORES Y CARACTERISTICAS
5.1 Las convocatorias
5.1.1 La Seguridad Alimentaria en el marco de la Cooperación la Unión
As in Yahoo! Answers, the Yelp dataset also has an unbalanced distribution (credible: 904, not_credible: 79) regarding the credibility class. The average precision was used to measure the performance of the classifier. The target of this measure is the negative class (not_credible) as in the Yahoo! Answers dataset. The performance of the full model using all features and the reduced models created by using various feature selection methods is summarized in Table 40. Those models were trained and evaluated using 10-fold cross-validation.
Table 40. Performance of Credibility Classifiers (Yelp) Selection Method Criterion AP
Full Model - 0.289
Stepwise P-value 0.509 Recursive Feature Elimination Accuracy 0.408 Learning Vector Quantization Variable Importance 0.285
The stepwise selection using the p-value showed the best performance. Considering that only 8% of the data belong to the negative class, the best classifier marked in bold showed a reasonable performance in terms of the average precision. For consistency with the feature ablation study for Yahoo! Answers, features selected by the stepwise selection using the AIC criterion were used. A fairly inflated AP based on the stepwise selection using the p-value also showed the possibility of overfitting. Features selected through the stepwise selection using the AIC were listed in Table 41. The model with those selected features was considered as a control model.
Table 41. Final Features Selected for Feature Ablation Study (Yelp)
Top-Category Sub-Category Features
Content Informativeness
Comprehensiveness Word count, lexical diversity, and document entropy
Specificity
Count of UMLS concepts, average counts of named entities and UMLS concepts, and count of named entities Presentation Readability ARI, FleschKincaidGradeLevel, LIX,
and RIX
Sentiment - Positive_emotion, sadness, polarity, and general_dislike
Source Expertise Count of UMLS concepts related to jargons
As there were five sub-level feature categories and one top-level category selected, six treatment models were created and compared to the control model. The statistical comparisons of
classifiers by the Friedman test is summarized in Table 42. The results indicated that there are statistically significant differences among classifiers regarding the AP. Thus, post hoc
comparisons of classifiers were conducted to examine which difference(s) is/are significant by using the paired Wilcoxon signed-rank test and Bootstrap-Shift test. Holm’s modification for the confidence level was applied to both post hoc tests. Results from the feature ablation study are presented (Table 43). The rows, columns, and symbols have the same meaning as in Table 39.
Table 42. Friedman Test Result (Yelp) Value
Friedman
Chi-squared 27.471 p-value 0.000*
Table 43. Feature Ablation Study Results (Yelp)
Feature group AP Percent
change P-value (Wilcoxon) P-value (Bootstrap) All 0.393 - Sentiment (t) 0.231 -41.22% 0.049* 0.002* - Content Informativeness (t) 0.253 -35.62% 0.006* 0.017* - Comprehensiveness (s) 0.348 -11.45% 1.000 0.724 - Specificity (s) 0.371 -5.6% 1.000 0.744 - Expertise (s) 0.392 -0.25% 1.000 1.000 - Readability (s) 0.414 +5.34% 1.000 1.000
The results of the Wilcoxon signed-rank test and Bootstrap-Shift test indicate only two statistically significant differences in means of average precision across six sets of paired samples. The most discriminative feature group for each level of the feature category (top- and
sub- level) is marked in bold. Among the top-level feature categories, the sentiment features were more effective and influential than the content informativeness features regarding the Bootstrap- shift test, but it was opposite in the Wilcoxon signed-rank test. No sub-level feature categories were influential and statistically significant in predicting credibility.
5.3.3 Discussion
The post hoc tests showed several important trends. First, the discriminative power of the feature categories of Yelp seems to be weaker than the feature categories of Yahoo! Answers. P- values of discriminative feature categories in the Yahoo! Answers were smaller than p-values of discriminative feature categories in Yelp for both Wilcoxon and Bootstrap-Shift tests. Yahoo! Answers' feature categories appeared to be more effective in predicting credibility in terms of the degree of those contributions. It should be noted that Yelp has a more biased distribution of credibility classes than Yahoo! Answers dataset does and assessors had a lower agreement in Yelp (Kf: 0.18) than Yahoo! Answers (Kf: 0.342). Due to the subjective nature of the review, the
agreement of the credibility assessors is low, and therefore it would be more difficult to predict the credibility for Yelp than Yahoo! Answers.
Second, the most discriminative feature category was found to be different in Yahoo! Answers and Yelp. In Yahoo! Answers, the content informativeness feature category and the comprehensiveness category which is the former's sub-category were the most discriminative. In the case of Yelp, the content informativeness feature category was discriminative, while the comprehensiveness category was not discriminative. Instead, the sentiment category became the most discriminative. In other words, for the negative class, the percentage of correct predictions that are truly negative is significantly affected by the sentiment category features in Yelp.
Third, two post hoc tests (Wilcoxon signed-rank test and Bootstrap-Shift test) showed similar results, but there were discrepancies also. In both tests, features showed lower p-values in the Bootstrap-shift test than in the Wilcoxon signed-rank test. In Yelp, the sentiment features were more effective and influential than the content informativeness features regarding the Bootstrap-shift test, but it was opposite in the Wilcoxon signed-rank test. Since differences in absolute values are ignored in the Wilcoxon test although larger differences are still considered more with the rank, there might be potential influences by the distribution of data and
characteristics of the test we used. We should not have any blind faith in a certain test. Rather, we should try to interpret the true meaning of the result by combining various tests.