The error due to attribute recognition has an impact on the accuracy of assessments made from a ranking. This impact is evaluated in this experiment by incorporating the attribute errors due to recognition. The errors are included using four different methods. The actual errors for all the 15 influencing attributes with the performances measured from Table 4.1 are included first. In the latter scenarios averaged performance for different attributes with and without the body shape attributes are used.
Lookup model: The accuracy of automatic comparison when testing using strong-10, all-10, weak-5 and rand-10 is displayed in Figure 4.3. The averages for errors are measured at 7%, 10% and 23%. The 7% error is calculated by averaging the performance for all the attributes except body shape. In the 23% error, performance average including the body shape estimate is used. A value of 10% is chosen to reflect approximate performance based on the 7% error. When the errors are included in recognition it is possible that the paired configuration being tested could be mapped to the same configuration. In this case when the scoring is compared in ranking, both configurations will go to the same position. The goal is to automatically compare two images from different configurations, therefore for calculating accuracy such a result is taken to be an incorrect one.
strong-10 all-10 weak-5 rand-10
0 0.2 0.4 0.6 0.8 1 Ranking Correctly rank ed
pairs act error
error at 7% error at 10% error at 23%
Figure 4.3: Correctly ranked pairs between 0 and 1 when using ranking with the recognition based error estimates. The actual error estimates along with the errors when an averaged per- formance is considered are shown.
Overall, highest scores are seen when the error is measured at 7% (highest value of 0.96 for strong-5). This result is as expected because the error included in the ranking for the lookup is the lowest when compared with other cases. Another observation that can be made is that highest scoring is achieved for strong-5 in contrast to all-10, weak-5 and rand-10. This is due to this group of annotators (strong-5) applying similar criteria on being tested against the expert. The all-10 ranking produces reasonably high scores of 0.69, 0.85, 0.83, 0.78 for the actual and error estimates of 7%, 10% and 23% compared to scores of 0.61, 0.72, 0.71, 0.67 for weak-5. This is an indication that on using a large enough number of pairwise comparisons, some of the noise from the weaker annotations can be reduced.
The weaker annotations show a lower score compared to strong-5 and all-10 as expected. This is most likely due to these annotators applying different criteria when making fashion judge-
4.4. Conclusions 59
ments. For random ranking rand-10, a value of 0.54 is seen which is the lowest value for this experiment. This is because this ranking is not correlated with any of the annotations. In summary, it can be established that the annotated rankings provide a significant gain in per- formance when compared with the baseline random annotations. It was also found that the validated annotation strong-5 provides better results when compared to weak-5 and all-10. To serve as a baseline in the following chapters, accuracy at 10% error will be utilized be- cause this error represents an averaged performance. Accuracy of 0.94, 0.83, 0.71 and 0.54 is obtained for strong-5, all-10, weak-5 and rand-10 at this error and these values represent a baseline performance.
4.4
Conclusions
In this chapter method for detecting the attributes automatically was presented and a lookup model based on the estimates from this recognition was evaluated. The recognition demon- strated a high performance for the 11 clothing attributes with the best results obtained when using the label of the maximum prediction estimate. However, a decline in performance was seen for the body shape attributes with the apple body shape showing better results compared to the other body shapes. This was because the shape differences are the largest for apple body shape compared to other shapes. The evaluation was performed using measures of recall and precision with associated 95% confidence interval. High values of recall and precision with an average of 0.93 ± 0.05 for the 11 clothing attributes were observed. And the averaged value for body shape was reduced to 0.30 ± 0.08. Due to the classification error being low for at- tributes except body shape, prediction estimates from those attributes without body shape will be utilized when evaluating the ranking approaches proposed in the following chapters. On testing the lookup model effectiveness of the approach was seen where the results indi- cated that on using a sufficiently large number of comparisons, noisy assessments made by non-expert annotators could be filtered out. This model also provides an effective baseline for testing pairwise comparisons with reference to accuracy. Particularly, accuracy values of 0.94, 0.83, 0.71, 0.54 obtained for strong-5, all-10, weak-5 and rand-10 in Section 4.3.2 at 10% error estimate will be used.
Chapter 5
Ranking images using matching
In this chapter rankings obtained from crowdsourced annotators are utilized to generate rank- ings using the nearest neighbour search approach. An overview of how the various matching techniques are incorporated is presented in Section 5.1. Matching is performed using the Bag of Visual Words (BoV) and Local Descriptor Matching (LDM) approaches. The procedure for obtaining a global ranking using the matches obtained is also discussed. Next, dataset, parame- ters and measures for evaluation are explained in Section 5.2. Finally, performance of matching and how that determines the global rankings is presented in Section 5.3.