2016; Schoenholz et al., 2016). Figure 3.7 illustrates this in regression and classification for increasingly deep finite-width NNs, alongside the GP kernel corresponding to an infinitely deep, infinitely wide BNN. This degeneration is slowed, but not avoided, with residual connections.
3.3
Specifying Priors for BNNs via Proxy GPs
Having used GPs as a lens to describe the meaning of BNN architecture and parameter priors, a method exploiting the connection for practical gain is now proposed.
GP kernels generally have a small number of hyperparameters, which can be conveniently tuned by maximising marginal likelihood (section 2.4.1) for example via gradient descent. Modern GP libraries such as GPflow (De et al., 2017) handle this automatically.
Conversely, priors for BNNs are generally tuned along with other model hyperparameters, through a relatively inefficient cross-validation process e.g. (Blundell et al., 2015; Khan et al., 2018), else ignored and fixed to some default value e.g. (Chua et al., 2018; Gal and Ghahramani, 2015). Certain variational methods allow direct optimisation of the prior and posterior distributions simultaneously (Hernández-Lobato and Adams, 2015), which has been reported to be effective by some authors (Wu et al., 2019), though challenging by others (Blundell et al., 2015).
Given that GPs and BNNs of certain kernels and architectures are interchangeable, we propose to perform optimisation of the BNN prior hyperparameters using the equivalent (‘proxy’) GP, which is efficiently done for smaller datasets. The optimised GP hyperparameters can then be transferred to build a BNN with suitable prior.
This scheme is well suited to scenarios where data grows incrementally, such as in model- based RL or active learning tasks. Here, a GP kernel can be quickly optimised on the initial small dataset. Training then proceeds using the equivalent BNN beginning with suitable prior hyperparameters, enabling learning to proceed in a scalable manner. Figure 3.8 shows a toy example of this on a 1D regression problem with ERF activations and kernel.
3.3.1
Illustrative Experiment: Model-Based RL
This experiment demonstrates the value of optimising BNN priors via proxy GPs in a model- based RL task with an incrementally growing dataset. The cartpole environment was used, where rewards,rt, are maximised by holding a pole (starting from a hanging position) at a
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.5 0.0 0.5 1.0 1.5 2.0 σ2 w1=σ2b1= 100,σw22= 1
Collect initial dataset Fixed GP prior MLL= 15.3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 σ2 w1=σb21= 4.96,σ2w2= 2.89 Step 1 Tune GP prior MLL= 23.9 −2.0 −1.5 −1.0 −05. 0.0 0.5 1.0 1.5 2.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 σ2 w1=σ2b1= 4.96,σ2w2= 2.89/H Step 2 Create a BNN w GP prior hyperparams −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 σ2 w1=σb21= 4.96,σ2w2= 2.89/H Step 3 Continue BNN learning on growing dataset
Figure 3.8 Using a proxy GP to specify a prior for a single-layer BNN with ERF activations. In step 1, the parameters of the GP ERF kernel (eq. 2.39) are optimised. Note the tuned variance in the GP model is normalised byHwhen building the BNN.
target location (central and upright). The action space is continuous, and applies a force to the cart in the rangeat∈[−1,1]. Episodes ran for 200 time steps,t= 0,1...199. The state
vector,st, consists of four variables; cart position, cart velocity, pole angle, pole angular
velocity. The reward function is assumed given.
Figure 3.9 The cartpole control task is a classic RL problem. Rewards are maximised by balancing the pole upright. The cart is moved left and right to achieve this, forming the action space for the environment.
In model-based RL, the agent learns a model of the dynamics of the environment, i.e.
p(st+1|st, at). Given this model, a planning algorithm is used to select actions. Here random
shooting was used with model-based predictive control (MPC) (Nagabandi et al., 2017), with a horizon length of60timesteps, and1000random tracjectories.
It has been shown that accounting for uncertainty is an important property of the dynamics prediction models (Chua et al., 2018), making this a good use case for BNNs. Four BNNs were used, one for the prediction of each state variable. This allowed full tailoring of each BNN prior to its prediction task. Bayesian inference was performed using anchored ensembling (section 4) with 5 NNs per ensemble. BNNs had one hidden layer of width
3.3 Specifying Priors for BNNs via Proxy GPs 49
To generate an initial dataset, a random policy was rolled out for three episodes. This resulted in a small dataset, on which a GP with ERF kernel (eq. 2.39) was optimised. The tuned prior hyperparameters were then converted to a BNN. After each further episode, the BNN was retrained over the full dataset.
Figure 3.10 shows the results of the prior tuned via proxy GP (red), against BNNs that were not tuned at all (blue). The tuned prior provides a clear performance boost in earlier trials, though as the dataset grows, the likelihood begins to dominate the posterior, and the prior has a decreasing effect.
0 500 1000 1500 2000 2500 3000 Timesteps 0 20 40 60 80 100 120 140 Rew ard p er episo de Tuned BNN prior Default BNN prior
Figure 3.10 Performance on a model-based RL task, cartpole, comparing a BNN with prior tuned using the proposed method (via proxy GP) vs. fixed default prior. Mean±two standard errors, repeated for three runs.
– Basic BNNs –
ReLU RBF
ERF Periodic
– Combinations of Basic BNNs – Additive BNNs
Leaky ReLU+Periodic ReLU+ERF
Multiplicative BNNs
ReLU×Periodic ReLU×ReLU
Complex BNNs
(ERF+ERF)×ReLU (ReLU×ReLU) +Periodic
Figure 3.11 BNN architecture determines our prior belief about a function’s properties. In general BNNs provide little flexibility in this regard - modifying only the activation function and length scale (‘Basic BNNs’). This chapter explores how to design BNNs to produce more expressive prior functions (‘Combinations of Basic BNNs’). Two prior draws are shown for each BNN architecture.