¿ Us ted c ompraría es te tipo de prendas s i tuv ieran una marc a para rec onoc erla?
PROCESO ESTILÍSTICO
7.1. Proceso Estilístico
7.1.2. Tendencias Infantiles
7.1.3.2. SEGUNDA COLECCIÓN
loss changes given a (infinitesimally) small increase inθi. The update rule is then,
θi,t+1=θi,t−τ
∂L ∂θi,t
, (2.8)
at learning stept, for some learning rate,τ, controlling the size of the update. Rather than computing the gradient over the whole dataset, in practice it is often computed over some subset (‘mini-batch’) of the data. This introduces some noise to the update step, hence the learning process is termed stochastic gradient descent (SGD). In order to compute the term ∂θ∂L
i for all parameters in the NN, backpropogation is used, which applies the chain
rule successively backwards through the NN. In order to calculate the derivatives, the loss function and NN operations must be differentiable.
Eq. 2.8 gives the vanilla gradient descent update rule. Modern NNs often use variants such as Adam (Kingma and Ba, 2015), which stores a moving average of the gradient mean and variance, and allows more effective weight updates to be made.
2.3
Modelling Uncertainty with Neural Networks
Having introduced uncertainty in section 2.1 and NNs in section 2.2, this section details the ability of standard NNs to model uncertainty. They are found lacking in certain areas, and the section proceeds by summarising existing methods and modifications that can improve their ability to model uncertainty.
A new method is contributed in chapter 5, and improvements to one of the most promising approaches are made in chapters 3 and 4.
2.3.1
Can Standard Neural Networks Model Uncertainty?
Can standard NNs (defined as in section 2.2) used for regression and classification tasks com- municate an appropriate level of epistemic and aleatoric uncertainty relating to a particular prediction? Table 2.1 summarises the answer.
By using a NN to parameterise a probability distribution, aleatoric uncertainty can be captured. For a classification NN, outputs can be squashed through a softmax, creating a final output that can then be interpreted as a multinomial distribution. A regression NN with one output could represent the mean of a normal distribution, with constant variance, capable of capturing only homoskedastic noise. A second output could be added to represent an
Table 2.1 Ability of standard NNs to quantify uncertainty types. For regression, it’s assumed prediction is of a single scalar value.
Classification NN Regression NN Regression NN (one output) (two outputs)
Aleatoric (Homoskedastic) ✓ ✓ ✓
Aleatoric (Heteroskedastic) ✓ ✗ ✓
Epistemic ✗ ✗ ✗
input-dependent variance, which would capture heteroskedastic noise. (Note that the loss function must be matched to the architecture.) These set ups are further detailed in section 2.4.1.
When asked to predict on a data point unlike the training data, the NN should increase its uncertainty. There is no mechanism built into standard NNs to do this. Even in the case of a two-output regression NN, there is no reason for the estimated variance to increase for a new data point. As such, standard NNs cannot estimate epistemic uncertainty. This is unfortunate since epistemic uncertainty is often the more useful component of the two; it is required for many use cases described in chapter 1, such as active learning, exploration in RL, and knowing when an unusual input is received.
2.3.2
Overview of Methods
This section briefly introduces some common methods used to improve uncertainty estimates of NNs.
Quantile Regression
Eq. 2.7 details a loss function that trains a NN to estimate the mean of a scalar target. Using instead,P
|yi−yˆi|, encourages output of the median. Further modifying to,τP∀yi<yˆi(yi−
ˆ
yi) + (τ−1)P∀yi≥yˆi(yi−yˆi), encourages return of some quantile (theτ percentile) of the
predictive distribution (Koenker and Hallock, 2001; Taylor, 2000). Note that no distributional form has been assumed. By estimating, say the0.025and0.975quantiles, a 95% PI could be constructed. Quantile regression is only able to capture aleatoric uncertainty.
2.3 Modelling Uncertainty with Neural Networks 21
Conformal Prediction
This method uses previously seen data to determine prediction regions (e.g. a prediction interval) that provide coverage with some pre-set probability (Shafer and Vovk, 2008). The size of the region indicates the uncertainty. Usefully, it is compatible with any predictive model, including NNs. It is difficult to determine the link with aleatoric and epistemic uncertainty (Hüllermeier and Waegeman, 2019).
Ensembling
A simple approach to estimating uncertainty is to train a small number of NNs (an ensemble), beginning from different initialisations and sometimes on noisy versions of the training data. When asked to make predictions on new data, there will be some diversity in the ensemble’s predictions, which can be interpreted as the uncertainty. The intuition is simple: if the new data point is similar to the training data, the NNs should all produce similar estimates, but if it is very different, there should be higher variance in the predictions.
This is generally presented as a non-Bayesian alternative to epistemic uncertainty estimation (Lakshminarayanan et al., 2017). Chapter 4 discusses this claim more thoroughly, and presents a modification that does lead to strong connections to Bayesian methodology.
Prior Networks
This recently proposed approach explicitly trains a NN to be uncertain on OOD data (Malinin and Gales, 2018). These OOD samples are generated either through adversarial techniques, or by using alternative datasets (e.g. train on CIFAR and use SVHN as OOD examples). This is most useful when these samples are representative of OOD data that will be encountered in the real world. The result is a single NN with good resistance to OOD examples. On the other hand, it may not be straightforward to extend the method to use cases of BNNs such as exploration and active learning, where the training distribution is incrementally growing. Prior networks can capture both aleatoric and epistemic uncertainty.
Bayesian Neural Networks
The Bayesian framework provides a principled framework for handling uncertainty, and has been combined with numerous models and applications. At its simplest, it specifies how to update beliefs (e.g. about functions) in light of evidence (e.g. data points), in an optimal
fashion. It was first applied to NNs in 1992 (MacKay, 1992b), motivated as a principled way to do hyperparameter selection as well as for its ability to capture epistemic uncertainty. A large body of work has followed, more recently under the name ‘Bayesian deep learning’. It is arguably the most prominent approach to handling uncertainty in NNs (as an example, the Baysian Deep Learning workshop has attracted around 100 abstracts in recent years at the NeurIPS conference, bayesiandeeplearning.org). The contributions of this thesis are primarily in this area, as such the following section fully introduces these models.