TERCERA COLECCIÓN - Tendencias Infantiles

¿ Us ted c ompraría es te tipo de prendas s i tuv ieran una marc a para rec onoc erla?

PROCESO ESTILÍSTICO

7.1. Proceso Estilístico

7.1.2. Tendencias Infantiles

7.1.3.3. TERCERA COLECCIÓN

2.4.1 Distributions Over Parameters

Earlier we introduced weights of a NN as scalar numerical values, that represent the strength of a connection between two neurons, and can be learnt using gradient descent.

A Bayesian NN requires a conceptually different set up. Rather than assigning each parameter a specific numerical value, we will instead model them as random variables with a probability distribution. This could either be a parametric probability distribution (such as normal or uniform), or an empirical probability distribution (by storing a set of parameter samples). Figure 2.10 illustrates this. Note that probability distributions are shown as though parameters are independent of each other, though this need not be the case.

Probability distributions in general are used to describe the uncertainty over the value of some variable, and this is no exception: we use them to represent the uncertainty in parameters.

Numercial values Paramteric distribution Empirical distribution

Figure 2.10 A neuron with point estimates over parameters produces a point estimate at its output. If parameters are modelled with distributions (parametric or empirical), the output will also be a distribution.

2.4 Making Neural Networks Bayesian 23

Although relatively straightforward conceptually, this new set up has implications both for computing the feedforward output (how should the output of the BNN now be com- puted?), and more challenging, the learning process (how can gradient descent be applied to distributions?).

Bayesian Posterior Distribution

Assigning arbitrary distributions to parameters doesn’t make a NN Bayesian. Bayesian NNs mean a NN with a specific distribution over the parameters, this distribution is known as the ‘Bayesian posterior’, denotedP(θθθ_|D). Here,_D, stands for some dataset, which in supervised learning is a set of input/output pairs,_D=_{(x1,y1),(x2,y2), . . . ,(xn,yn)}. It is simple to

write an equation to calculate this posterior distribution in terms of several other distributions.

P(θθθ_|D) = P(D|θθθ)P(θθθ)

P(_D) (2.9)

Note that we have slightly abused notation here, and more precisely can be written,P(_D|θθθ) :=

P(_{yi}ni=1|θθθ,{xi}ni=1), and,P(D) :=P({yi}ni=1|{xi}ni=1).

Likelihood

On the RHS of eq. 2.9 is,P(_D|θθθ), called the ‘likelihood’, which represents how likely the data is given the parameters. It returns a single scalar representing the probability of drawing that dataset, given the model parameters.

In order to formulate the likelihood function some probability distribution is specified that is parameterised by the output of a NN. For a regression task a NN with a single output might specify the mean of a normal distribution, with constant variance. If data points are treated independent and identically distributed (iid), the likelihood is then given by,

P(D|θθθ) =

i=1

N(yi|f(xi,θθθ), σϵ2), (2.10)

where the NN prediction has been written,f(x,θθθ), to explicitly depend on the parameters, and some amount of noise has been assumed in the data, of varianceσ2_ϵ. This is appropriate if the data noise is homoskedastic (section 2.1.1).

If there is reason to believe the data noise is heteroskedastic (section 2.1.1), one might use a NN with two outputs, one specifying the mean and one the variance (or log variance to

ensure it’s positive). The likelihood is then, P(_D|θθθ) = N Y i=1 N(yi|f1(xi,θθθ), f2(xi,θθθ)). (2.11)

For classification tasks, withCclasses, and one-hot-encoded labelsyi,j ∈ {0,1}, it’s usual to

have a NN withCoutputs, which are passed through a softmax function in order to convert into a normalised probability (each output between0and1, together sum to1). Denoting

fj(xi,θθθ)as outputj after the softmax, the likelihood is given by a multinomial distribution,

P(_D|θθθ)_∝ N Y i=1 C Y j=1 fj(xi,θθθ)yi,j. (2.12)

This covers the likelihood functions most commonly implemented with BNNs.

Prior

Also in eq. 2.9, we have,P(θθθ), which is termed the ‘prior’. Broadly speaking, a modeller sets this distribution based on prior knowledge available about the task.

Textbook introductions to Bayes might give examples about coin flipping, where a parameter represents the probability of a coin being biased. The prior encodes an initial belief about this parameter, which is updated by observed data. NNs introduce a complication; it is not obvious what a prior distribution over a NN parameter means about the implied prediction, which is what a modeller might have knowledge about.

It is common (though by no means necessary, e.g. (Nalisnick, 2018)) for BNNs to have iid Gaussian priors, of constant mean and variance across a layer, which leads to interesting connections with GPs. This connection forms an important foundation for later contributions, and is described in section 2.6. This will allow an interpretation of parameter priors as functional priors.

A prior is generally set before observing evidence, however empirical Bayes describes a paradigm whereby hyperparameters controlling the prior are set to be their most likely values according to the data. For example, in a GP this could be the kernel length scale, or in BNNs, the variance of the weight priors for a particular layer. Note that this philosophy faces some criticism (Blundell et al., 2015; Gelman, 2008).

2.4 Making Neural Networks Bayesian 25

Marginal Likelihood

The final term on the RHS of eq. 2.9 is P(_D), known as the ‘marginal likelihood’ or ‘evidence’ or ‘prior predictive distribution’, calculated by,

P(_D) =

P(_D|θθθ)P(θθθ)dθθθ. (2.13) A simple interpretation is that it is a normalising constant, ensuring thatP(_D|θθθ)P(θθθ)sums to one over the domain ofθθθ, which is a requirement of a probability distribution.

It is also useful as a measure of how well a model fits the data (Rasmussen and Williams, 2006). This can be valuable in empirical Bayes procedures where hyperparameters of the prior are to be estimated. Examples of this are provided for a GP in figure 2.14.

2.4.2 Posterior Predictive Distribution

In models where parameters themselves are of interest (as in the coin flipping example), it might be that generating their posterior probability distribution, P(θθθ_|D), is the goal of Bayesian inference. By contrast, in NNs we are less interested in the parameter distributions themselves, and more interested in making predictions using the posterior distributionP(θθθ_|D)

for some new data point,x∗. This leads to the posterior predictive distribution,P(y∗_|x∗,_D).

P(y∗_|x∗,_D) =

P(y∗_|x∗,θθθ)P(θθθ_|D)dθθθ (2.14) One insight used in later contributions (chapter 4) is that provided samples can be generated from the posterior, it may not be necessary to fully capture the parameter posterior itself.

2.4.3 Learning in BNNs

In section 2.4.3, we described a learning mechanism, SGD, which is widely used for NNs with point estimates of the parameters and a defined loss function. One of the major challenges of BNNs is that we can no longer use this mechanism directly. The objective of BNNs is to find the posterior distribution; but there is no obvious loss function that can be minimised, nor is it clear how the parameter update rule in eq. 2.8 would apply to distributions.

Computing the posterior analytically is possible only in special models where the prior and likelihood distributions are conjugate. This occurs, for example, with a beta prior and

Bernoulli likelihood, which apply in the coin flipping example. For most models, including BNNs, eq. 2.9 is not tractable.

We now briefly introduce the main classes of techniques used to perform Bayesian inference in NNs. Note that this is an vast, active area of research and we attempt to summarise only the main approaches.

One of the disadvantages of many of these methods is that they fall outside common frame- works used for training NNs used by the majority of the community. Indeed one method that doesn’t, MC Dropout, is (arguably) the most widely used because of this. Our proposal in chapter 4 provides a similarly easy-to-implement method.

Sampling Methods - MCMC, HMC

Markov chain Monte Carlo (MCMC) methods provide a procedure for sampling from a probability distribution. Stochastic chains explore the parameter space. They are designed so that the stationary distribution of the chain matches the target probability distribution. Hence one can obtain a set of parameter samples forming an empirical distribution.

MCMC methods can be implemented with a variety of algorithms. Metropolis–Hastings works well in smalls-scale settings but can be slow to converge for more complex scenarios. Hamiltonian Monte Carlo (HMC) uses gradient information to move around the parameter space more efficiently. It was shown to be better suited to use in BNNs (Neal, 1997), and more recently has been made more practical (Chen et al., 2014).

Variational Inference

Variational inference (VI) provides an alternative paradigm to sampling methods. A parametric approximating distribution is defined, qν(θθθ), with the goal of finding distribution

parameters, ν, to minimise the Kullback–Leibler divergence between the approximating distribution and the true posterior, argmin_ν DKL(qν(θθθ)||p(θθθ|D)). This optimisation is done

by maximising a surrogate quantity called the ELBO (evidence lower bound), given by, ELBO=

qν(θθθ) logp(D|θθθ)dθθθ+KL(qν(θθθ)||p(θθθ)) (2.15)

The major attraction of the method is precisely that it can be framed as an optimisation problem, which opens the door to widely used tools already available. A disadvantage is that the the quality of approximation of the posterior is limited by both how restrictive the

In document Diseño de colecciones pret a infantil con detalles de protección contra lesiones para niños/as de 2 a 5 años de la ciudad de Quito (página 159-174)