¿ Us ted c ompraría es te tipo de prendas s i tuv ieran una marc a para rec onoc erla?
PROCESO ESTILÍSTICO
7.1. Proceso Estilístico
7.1.2. Tendencias Infantiles
7.1.3.3. TERCERA COLECCIÓN
2.4.1
Distributions Over Parameters
Earlier we introduced weights of a NN as scalar numerical values, that represent the strength of a connection between two neurons, and can be learnt using gradient descent.
A Bayesian NN requires a conceptually different set up. Rather than assigning each parameter a specific numerical value, we will instead model them as random variables with a probability distribution. This could either be a parametric probability distribution (such as normal or uniform), or an empirical probability distribution (by storing a set of parameter samples). Figure 2.10 illustrates this. Note that probability distributions are shown as though parameters are independent of each other, though this need not be the case.
Probability distributions in general are used to describe the uncertainty over the value of some variable, and this is no exception: we use them to represent the uncertainty in parameters.
Numercial values Paramteric distribution Empirical distribution
Figure 2.10 A neuron with point estimates over parameters produces a point estimate at its output. If parameters are modelled with distributions (parametric or empirical), the output will also be a distribution.
2.4 Making Neural Networks Bayesian 23
Although relatively straightforward conceptually, this new set up has implications both for computing the feedforward output (how should the output of the BNN now be com- puted?), and more challenging, the learning process (how can gradient descent be applied to distributions?).
Bayesian Posterior Distribution
Assigning arbitrary distributions to parameters doesn’t make a NN Bayesian. Bayesian NNs mean a NN with a specific distribution over the parameters, this distribution is known as the ‘Bayesian posterior’, denotedP(θθθ|D). Here,D, stands for some dataset, which in supervised learning is a set of input/output pairs,D={(x1,y1),(x2,y2), . . . ,(xn,yn)}. It is simple to
write an equation to calculate this posterior distribution in terms of several other distributions.
P(θθθ|D) = P(D|θθθ)P(θθθ)
P(D) (2.9)
Note that we have slightly abused notation here, and more precisely can be written,P(D|θθθ) :=
P({yi}ni=1|θθθ,{xi}ni=1), and,P(D) :=P({yi}ni=1|{xi}ni=1).
Likelihood
On the RHS of eq. 2.9 is,P(D|θθθ), called the ‘likelihood’, which represents how likely the data is given the parameters. It returns a single scalar representing the probability of drawing that dataset, given the model parameters.
In order to formulate the likelihood function some probability distribution is specified that is parameterised by the output of a NN. For a regression task a NN with a single output might specify the mean of a normal distribution, with constant variance. If data points are treated independent and identically distributed (iid), the likelihood is then given by,
P(D|θθθ) =
N
Y
i=1
N(yi|f(xi,θθθ), σϵ2), (2.10)
where the NN prediction has been written,f(x,θθθ), to explicitly depend on the parameters, and some amount of noise has been assumed in the data, of varianceσ2ϵ. This is appropriate if the data noise is homoskedastic (section 2.1.1).
If there is reason to believe the data noise is heteroskedastic (section 2.1.1), one might use a NN with two outputs, one specifying the mean and one the variance (or log variance to
ensure it’s positive). The likelihood is then, P(D|θθθ) = N Y i=1 N(yi|f1(xi,θθθ), f2(xi,θθθ)). (2.11)
For classification tasks, withCclasses, and one-hot-encoded labelsyi,j ∈ {0,1}, it’s usual to
have a NN withCoutputs, which are passed through a softmax function in order to convert into a normalised probability (each output between0and1, together sum to1). Denoting
fj(xi,θθθ)as outputj after the softmax, the likelihood is given by a multinomial distribution,
P(D|θθθ)∝ N Y i=1 C Y j=1 fj(xi,θθθ)yi,j. (2.12)
This covers the likelihood functions most commonly implemented with BNNs.
Prior
Also in eq. 2.9, we have,P(θθθ), which is termed the ‘prior’. Broadly speaking, a modeller sets this distribution based on prior knowledge available about the task.
Textbook introductions to Bayes might give examples about coin flipping, where a parameter represents the probability of a coin being biased. The prior encodes an initial belief about this parameter, which is updated by observed data. NNs introduce a complication; it is not obvious what a prior distribution over a NN parameter means about the implied prediction, which is what a modeller might have knowledge about.
It is common (though by no means necessary, e.g. (Nalisnick, 2018)) for BNNs to have iid Gaussian priors, of constant mean and variance across a layer, which leads to interesting connections with GPs. This connection forms an important foundation for later contributions, and is described in section 2.6. This will allow an interpretation of parameter priors as functional priors.
A prior is generally set before observing evidence, however empirical Bayes describes a paradigm whereby hyperparameters controlling the prior are set to be their most likely values according to the data. For example, in a GP this could be the kernel length scale, or in BNNs, the variance of the weight priors for a particular layer. Note that this philosophy faces some criticism (Blundell et al., 2015; Gelman, 2008).
2.4 Making Neural Networks Bayesian 25
Marginal Likelihood
The final term on the RHS of eq. 2.9 is P(D), known as the ‘marginal likelihood’ or ‘evidence’ or ‘prior predictive distribution’, calculated by,
P(D) =
Z
P(D|θθθ)P(θθθ)dθθθ. (2.13) A simple interpretation is that it is a normalising constant, ensuring thatP(D|θθθ)P(θθθ)sums to one over the domain ofθθθ, which is a requirement of a probability distribution.
It is also useful as a measure of how well a model fits the data (Rasmussen and Williams, 2006). This can be valuable in empirical Bayes procedures where hyperparameters of the prior are to be estimated. Examples of this are provided for a GP in figure 2.14.
2.4.2
Posterior Predictive Distribution
In models where parameters themselves are of interest (as in the coin flipping example), it might be that generating their posterior probability distribution, P(θθθ|D), is the goal of Bayesian inference. By contrast, in NNs we are less interested in the parameter distributions themselves, and more interested in making predictions using the posterior distributionP(θθθ|D)
for some new data point,x∗. This leads to the posterior predictive distribution,P(y∗|x∗,D).
P(y∗|x∗,D) =
Z
P(y∗|x∗,θθθ)P(θθθ|D)dθθθ (2.14) One insight used in later contributions (chapter 4) is that provided samples can be generated from the posterior, it may not be necessary to fully capture the parameter posterior itself.
2.4.3
Learning in BNNs
In section 2.4.3, we described a learning mechanism, SGD, which is widely used for NNs with point estimates of the parameters and a defined loss function. One of the major challenges of BNNs is that we can no longer use this mechanism directly. The objective of BNNs is to find the posterior distribution; but there is no obvious loss function that can be minimised, nor is it clear how the parameter update rule in eq. 2.8 would apply to distributions.
Computing the posterior analytically is possible only in special models where the prior and likelihood distributions are conjugate. This occurs, for example, with a beta prior and
Bernoulli likelihood, which apply in the coin flipping example. For most models, including BNNs, eq. 2.9 is not tractable.
We now briefly introduce the main classes of techniques used to perform Bayesian inference in NNs. Note that this is an vast, active area of research and we attempt to summarise only the main approaches.
One of the disadvantages of many of these methods is that they fall outside common frame- works used for training NNs used by the majority of the community. Indeed one method that doesn’t, MC Dropout, is (arguably) the most widely used because of this. Our proposal in chapter 4 provides a similarly easy-to-implement method.
Sampling Methods - MCMC, HMC
Markov chain Monte Carlo (MCMC) methods provide a procedure for sampling from a probability distribution. Stochastic chains explore the parameter space. They are designed so that the stationary distribution of the chain matches the target probability distribution. Hence one can obtain a set of parameter samples forming an empirical distribution.
MCMC methods can be implemented with a variety of algorithms. Metropolis–Hastings works well in smalls-scale settings but can be slow to converge for more complex scenarios. Hamiltonian Monte Carlo (HMC) uses gradient information to move around the parameter space more efficiently. It was shown to be better suited to use in BNNs (Neal, 1997), and more recently has been made more practical (Chen et al., 2014).
Variational Inference
Variational inference (VI) provides an alternative paradigm to sampling methods. A para- metric approximating distribution is defined, qν(θθθ), with the goal of finding distribution
parameters, ν, to minimise the Kullback–Leibler divergence between the approximating distribution and the true posterior, argminν DKL(qν(θθθ)||p(θθθ|D)). This optimisation is done
by maximising a surrogate quantity called the ELBO (evidence lower bound), given by, ELBO=
Z
qν(θθθ) logp(D|θθθ)dθθθ+KL(qν(θθθ)||p(θθθ)) (2.15)
The major attraction of the method is precisely that it can be framed as an optimisation problem, which opens the door to widely used tools already available. A disadvantage is that the the quality of approximation of the posterior is limited by both how restrictive the