• No se han encontrado resultados

In the previous section it was shown that Prior Networks trained on MNIST can successfully detect out-of-distribution images. However, MNIST is a very easy task, and it is necessary to investigate whether the Prior Networks scale to more complex tasks, such as CIFAR-10. The experiments described in this section evaluate the ability of models trained on CIFAR-10 to

5.4 Out-of-Distribution sample Detection 107

discriminate between in-domain images from the CIFAR-10 test set and out-of-distribution images from the test sets of SVHN, LSUN and TinyImageNet. This is a more challenging task, as there are far more possible factor of variation in32×32color images of real objects than in black-and-white images of handwritten characters. The result of these experiments are presented in table 5.9.

OOD Data Model Total Uncertainty Knowledge Uncertainty Conf. Ent. M.I. EPKL D.Ent.

SVHN DNN 90.3±1.9 91.0±2.1 - - - MCDP 90.2±1.9 91.0±2.1 88.9±1.8 88.9±1.8 - ENS 92.0±NA 92.9±NA 89.5±NA 89.0±NA - PN-RKL 98.1±1.2 98.2±1.1 98.3±0.9 98.3±0.9 98.2±1.1 LSUN DNN 89.8±0.9 91.4±0.9 - - - MCDP 89.7±0.9 91.2±0.9 89.9±1.2 89.9±1.2 - ENS 92.5±NA 94.3±NA 93.2±NA 92.5±NA - PN-RKL 95.6±0.9 95.7±0.9 95.8±0.8 95.8±0.8 95.8±0.8 TIM DNN 87.6±2.2 88.8±2.3 - - - MCDP 87.4±2.2 88.6±2.3 87.2±1.5 87.2±1.5 - ENS 90.0±NA 91.5±NA 90.3±NA 89.7±NA - PN-RKL 95.6±0.7 95.7±0.7 95.8±0.7 95.8±0.7 95.8±0.7

Table 5.9 CIFAR-10 out-of-domain detection results in terms of mean % AUROC±2σacross 10 random initializations (except ENS). CIFAR-10 test set is used as in-domain data and the test sets of SVHN (10,000 image subset), LSUN and TIM as out-of-domain data.

There are several trends in the results. Firstly, the results show that Prior Networks consistently outperform the baselines approaches on all tasks and using all measures of uncertainty. The best baseline approach, as in previous experiments, is an explicit ensemble of DNNs. At the same time, ensembles generated via Monte-Carlo dropout yield the same level of performance as standard DNNs trained with maximum likelihood. The results show that, based on the performance of baseline approaches, the most challenging out-of- distribution dataset is TinyImageNet. This is expected, as it contains images of a wide variety of real object categories and is the most similar to CIFAR-10. As before, measures ofknowledge uncertaintyderived from a Prior Network, specifically the expected pairwise KL-divergence and differential entropy, yield the best results. However, the advantage over the other measures is small as CIFAR-10, like MNIST, is also a lowdata uncertaintydataset. Curiously, the best measure of uncertainty derived from ensembles is the entropy of the predictive posterior, a measure oftotal uncertainty. It is not clear why this is the case, but

108 Experimental Evaluation of Prior Networks

(a) ENS (b) PN-RKL

Figure 5.8 Histogram of mutual information for in-domain (CIFAR-10 test set) and out-of- domain (TinyImageNet test set) images derived from explicit ensemble (ENS) and Prior Network (PN-RKL). Predictions of 10 PN-RKL models trained from different random initializations are concatenated together.

supports the assertion that it is difficult to control the behaviour of ensembles such that they are diverse out-of-domain.

It is interesting to analyze the difference in the behaviour of ensembles and Prior Networks in greater detail. Consider figure 5.8, which depicts the histograms of estimates of mutual information derived from an ensemble and Prior Networks for CIFAR-10 and TinyImageNet images. Figure 5.8 shows that the ensemble yields a much tighter distribution of mutual information and is less ‘diverse’ than a Prior Network. There is also a greater region of overlap between in-domain images and out-of-domain images. At the same time, the Prior Network clearly separates out the in-domain and out-of-distribution images. However, there are a small number of out-of-distributions images for which the Prior Network yields low uncertainty and a set of in-domain images for which the Prior Network yields maximum uncertainty.

These images are depicted in figure 5.9. Figure 5.9a shows that the high-uncertainty in-domain images are ‘odd’ images with obstructions, strange backgrounds, etc, making it difficult to actually determine what they represent. As discussed in section 5.3.2, Prior Networks make more low-confidence predictions than ensembles, which is likely a result of out-of-distribution data, in this case CIFAR-100, being close to the in-domain data in certain regions. Improving the performance of the classifier and better choice of out-of-distribution training data could possibly decrease the number of high uncertainty in-domain images which are not misclassified. At the same time, figure 5.9b shows that the low uncertainty out-of-distribution images are images of dogs, cats and cars, which are classes present in the

5.5 Chapter Summary 109

CIFAR-10 dataset. Thus, there is a partial overlap between TinyImageNet and CIFAR-10 classes.

5.5

Chapter Summary

In this chapter the construction of Prior Networks on the MNIST, SVHN and CIFAR-10 image classification datasets was investigated. Measures of uncertainty derived from Prior Networks trained on these datasets were evaluated on the tasks of misclassification detection and out-of-distribution sample detection. Measures of uncertainty derived from standard DNNs, a Monte-Carlo dropout ensemble and an explicit ensemble of models were used as baselines.

In section 5.3 it was shown that Prior Networks do not provide a performance gain over baseline approaches on the tasks on misclassification detection. Furthermore, misclassifi- cation detection performance of Prior Networks can be degraded if the out-of-distribution training data is too close to the in-domain region. At the same time, the out-of-distribution training data can also act as a regularizer and improve classification performance of Prior Networks. The results in section show that an explicit ensemble of models yields both the lowest classification error rates and the best misclassification detection performance. It was determined that the confidence in the predictions (the probability of the mode of the predictive distribution) is consistently the best measure of uncertainty for misclassification detection.

Results in section 5.4 show that Prior Networks are able to outperform baseline ap- proaches on the task of out-of-distribution input detection on both the MNIST and CIFAR-10 dataset. Results on the SVHN dataset, available in appendix C, also show that a Prior Network is capable outperforming baseline approaches by a large margin. The best baseline

(a) High Uncertainty CIFAR-10 images (b) Low Uncertainty TinyImagenet images

Figure 5.9 Highest mutual information in-domain (CIFAR-10 test set) images and lowest mutual information out-of-domain TinyImageNet images.

110 Experimental Evaluation of Prior Networks

approach to uncertainty estimation throughout all experiments was an explicit ensemble of DNNs. The results in this section also show that measures ofknowledge uncertainty, such as mutual information, expected pairwise KL-divergence and differential entropy, yield better out-of-distribution detection results, which illustrates the benefit of being able to decompose

total uncertaintyintodataandknowledge uncertainty.

The superior out-of-distribution detection performance of Prior Networks supports the assertion that it is easier to control the out-of-distribution behaviour of a distribution over distributions via choice of training data, rather than by appropriate choice of prior and approximate inference method. However, the choice of out-of-distribution training data is non-trivial and needs to be further explored in detail. Furthermore, the low classification error rates and superior misclassification detection performance of explicit ensembles of models indicates that appropriate choice of ‘out-of-distribution’ data which represents other forms of dataset shift also needs to be explored. As a Prior Network represents a distribution over distributions and is theoretically capable of emulating an ensemble, a possible avenue of future work would be to distill an ensemble into a Prior Network, combining the advantages of both. Additionally, future work should investigate the application of Prior Networks to more complex datasets, such as ImageNet [27], which have more than 10 classes, and to structured data, such as language and speech. Active learning applications of Prior Networks should also be investigated in order to further evaluate the practical benefit of separating out the sources of uncertainty. Finally, Prior Networks should also be evaluated on a range of regressions tasks.

This chapter concludes the first part of this thesis, which focused on uncertainty estimation. The next three chapters will consider the application of uncertainty estimation to the areas of spoken language proficiency assessment.

Chapter 6

Spoken Language Proficiency

Assessment

The first part of this thesis explored the area of predictive uncertainty estimation, discussed single model and ensemble based approaches in chapter 3, proposed a new class of models called Prior Networks in chapter 4 and compared them to previous approaches in chapter 5. This chapter begins the second part of this thesis, which investigates the application of deep learning and uncertainty estimation to the area of automaticspokenlanguage assessment, specifically automatic grading of spoken proficiency examinations and automatic assessment of relevance of spoken responses to open-ended exam prompts.

The increasing demand for language learning and for practice tests available at any time make the development of automatic assessment systems an attractive proposition [110]. The goal of an automatic grader is to assess language competence of an examination candidate with an accuracy matching that of a human grader, but faster, with greater consistency and at a fraction of the cost. As mentioned in the introduction of this thesis, automatic assessment, especially high-stakes automatic assessment, requires that models provide measures of uncertainty in their predictions in order to avoid making mistakes which can adversely affect the course of peoples’ lives, as was the case with the Irish veterinarian who failed an automatically assessed spoken English exam in Australia due to her accent [1]. This motivates the application of the approaches to uncertainty estimation, discussed in the first part of this thesis, to the tasks of automatic grading and automatic relevance assessment, which is considered the second part of this thesis.

The current chapter introduces the area of automatic assessment of non-native spoken language proficiency and discusses the attributes and challenges associated with this task. This chapter is structured as follows: section 6.1 describes the overall task of spoken language assessment, grade levels and attributes of the BULATS and LinguaSkill exams, provided by

112 Spoken Language Proficiency Assessment

Cambridge English Language Assessment. Section 6.2 describes the automatic assessment pipeline used in this work as well the tasks of automatic grading and automatic prompt- response relevance assessment.

Documento similar