There is no direct measure to find the agreement between the annotators given the subjec- tive nature of the annotation task. In this section a study on the annotators’ agreement level using the images in the Validation set is presented. 2 methods are used to measure the annotation consistency. Since emotion perception is a subjective task, each perceiver can recognise different emotions after seeing the same image. For example in both Figure 3.24.a and 3.24.b, the person in bounding-box seems to feel Affection, Happiness and Plea- sure and the annotators have annotated with these categories with consistency. However, not everyone has selected all these emotion categories. Also, it is seen that annotators do not agree in the emotions Excitement and Engagement. However, these categories are reasonable in this situation. Another example is that of Roger Federer hitting a tennis ball in Figure 3.24.c. He is seen predicting the ball (or Anticipating) and clearly looks Engaged in the activity. He also seems Confident in getting the ball. In spite of the annotation process being subjective in nature and not all annotators agreeing on every annotation, their responses have good quality and subtlety.
(a)
(b)
(c)
Figure 3.24: Five different annotators for a given person in context
3.2. ANALYSIS 57 in the statistical analysis (Section 3.2.1), different quantitative analysis on the annotation agreement were conducted. First focus was on analysing the agreement level in the cate- gory annotation. Given a category annotated (or assigned) to a person in an image, the number of annotators agreeing for that particular category is considered as an agreement measure. Accordingly, it was calculated, for each category and for each annotation in the validation set, the agreement amongst the annotators and those values were sorted across categories. Figure 3.25 shows the distribution on the percentage of annotators agreeing for an annotated category across the validation set.
There seems a need to find a criteria with which we could compare annotator-agreement analysis amongst the discrete categories. Normalize each category with the number of people annotated for that category, then empirically weigh the number of people anno- tated by 5, 4, 3, 2, 1 annotators and quantify in the form of a rank (an annotation agreed upon by 5 annotators has the highest importance and is given the highest weight). This rank ranges, in case of EMOTIC dataset, between [1.04, 2.87]. Practical values of this rank have the limits [0, N ] - where N is the number of annotators for each annotation. Accordingly, the categories are sorted based on this rank and plotted in decreasing or- der of annotator-agreement in Fig. 3.25. We observe that Engagement has the highest annotator-agreement which means that for each instance that Engagement is annotated, 62% of times 3 or more annotators (out of 5) agree. Similarly, for Pain, of all the instances where it is annotated, there are 2 or more annotators who agree 26% of times.
The agreement between all the annotators for a given person using Fleiss’ Kappa (κ) was also computed. Fleiss’ Kappa is a common measure to evaluate the agreement level among a fixed number of annotators when assigning categories to data. In general, for the validation set, if an annotator selects an emotion category, the probability that he is in agreement with at least one of the four other annotators in selecting this category is 50%. In case of EMOTIC, given a person to annotate, there is a subset of 26 categories. If we have N annotators per image, that means that each of the 26 categories can be selected by n annotators, where 0 ≤ n ≤ N . Given an image we compute the Fleiss’ Kappa per each emotion category first, and then the general agreement level on this image is computed as the average of these Fleiss’ Kappa values across the different emotion categories. We obtained that more than 50% of the images have κ > 0.30. Figure 3.26.a shows the distribution of kappa values across the validation set for all the annotated people in the validation set, sorted in decreasing order.
Keeping the annotations’ parameters constant, we tried to find a random agreement between the annotators. This random agreement value, over 1000 iterations for EMOTIC is κ ≈ 0.15. Notice that total disagreement gives κ = 0. The random kappa value
58 CHAPTER 3. EMOTIC DATASET
Figure 3.25: Representation of agreement between multiple annotators. Categories are sorted in decreasing order according to the average number of annotators that agreed for the category.
3.2. ANALYSIS 59 (κ ≈ 0.15) in comparison to the actual value (κ > 0.30) indicates that there is a significant agreement level even though the task of emotion recognition is subjective.
(a) Distribution of Kappa Values across Validation set (sorted)
(b) Std across Validation set (sorted)
Figure 3.26: (a) Kappa values and (b) Standard deviation (Std), for each annotated per- son in validation set
Regarding to the continuous dimensions, the agreement is measured by the standard deviation (SD) of the different annotations. In general, the average SD across the Valida- tion set is 1.04, 1.57 and 1.84 for Valence, Arousal and Dominance respectively - indicating
60 CHAPTER 3. EMOTIC DATASET that Dominance has higher (±1.84) dispersion than the other dimensions. It reflects that annotators disagree more often for Dominance than for the other dimensions which is understandable since Dominance is more difficult to interpret than Valence or Arousal Mehrabian [1995]. As a summary, Figure 3.26.b shows the standard deviations of all the images in the validation set for all the 3 dimensions, sorted in decreasing order.
An important aspect of doing agreement analysis is the tool or method used. For example, the agreement between the annotators decreases if the scales for capturing the responses is increased (Whitehill et al. [2014]). In general, random agreement between annotators is higher for a binary scale (1 or 0) as compared to when there are n options to choose from (n > 2). We did a similar agreement analysis for continuous dimension’s representation for EMOTIC. Reducing the scale from [1 − 10] to [1 − 5], we re-calculated the average SD across the Validation set and found that it decreases, suggesting higher agreement. The new average SD across the Validation set in contrast to the previous values are (0.54, 1.04), (0.82, 1.57), (0.94, 1.84) for Valence, Arousal and Dominance respectively. Similar interpretations can be made for the new values, however, the impor- tant point to note is that the SD decreases when we reduce the scales. Clearly, lower SD indicates better agreements, depending on the scale used.
The average values of each dimension for a given category is also a good character- ization of annotation agreement. For example, Affection has (V, A, D) = (6.8, 5.3, 6.6) - suggesting high positiveness, medium activeness and high control. This interpretation makes sense when we see Affection in Figure 3.3(2). Similarly, for Suffering, (V, A, D) = (3.7, 4.7, 4.3) - low positiveness (or high negativity), medium-low activeness and low con- trol. Again, when we observe a person who is Suffering (example: Figure 3.3(26)), we see that he is feeling negative emotions, is not too aroused and is not in control. Such com- parisons are consistent across categories indicating good agreement amongst annotators.