• No se han encontrado resultados

For completeness we briefly mention several other popular approaches to approx- imate Gaussian processes that we do not use in this thesis.

Sparse spectral approximationsapproximate the kernel in the Fourier domain

Sparse spectral

approximations as opposed to the real domain. This class of approximations only applies to stationary, i.e. shift-invariant, covariance functions, as it makes use ofBochner’s theorem(Stein,1999).

Most practical stationary covariance functions, such as the squared exponential, can be expressed as a Fourier transform of a correspondingspectral densitys2

fpS(s):

k(τ =xx0) = Z

s2fpS(s) e2πiτ·sds. (2.30)

Spectral approximations approximate the continuous density pS(s) in the

integral (Equation (2.30)) by a sum ofδ-functions, thus, selecting a finite set of frequencies in a Monte Carlo fashion.

Lázaro-Gredilla et al. (2010) introduce Sparse Spectral Gaussian Processes (SSGP), in which they optimise these frequencies using the marginal likelihood. They show that this model is equivalent to Bayesian basis function regression with trigonometric basis functions, and can lead to good performance compared to FITC. However, this approach is also prone to overfitting by being overconfident, especially for a large number of spectral points, compared to the number of data points (Lázaro-Gredilla et al., 2010). They propose to diagnose overfitting by widely varying predictive distributions for different initial conditions.

Random Fourier feature methods choose the frequencies randomly and have been referred to asRandom Kitchen Sinks(RKS) in the kernel machines literature

(Rahimi and Recht,2008,2009). For a more general discussion to approximatepS

inEquation (2.30)by a mixture of a continuous and a discrete density refer to Samo and Roberts (2015). Recently, several extensions to RKS have been proposed that also address GP regression and further speed up RKS using structured matrices for fast matrix vector multiplication. These include Fastfood (Le et al.,

other approximations 49 2013), à la carte (Yang et al.,2015), or Extended and Unscented Kitchen Sinks (Bonilla et al.,2016).

Structured matrix approximationsutilise structure in the covariance matrix that Structured matrix approximation allows for fast computation of matrix inverses; while inversion and the eigen-

decomposition of a generalN×Nmatrix scale cubically withN, faster algorithms exist if the matrix has structure. For example, when the covariance function factorises over dimensions or groups of dimensions, the kernel matrix has a kronecker structure over these groups and their eigendecomposition can be computed separately (Saatçi,2011; Wilson et al.,2014). Similarly, when evaluating a stationary covariance function on a regular 1D grid, the resulting covariance matrix will have Toeplitz structure, which allows forO(N2)inversion (Storkey, 1999; Yunong Zhang et al.,2005; Cunningham et al.,2008). Wilson and Nickisch (2015) unify both of these approaches in the KISS-GP approximation in which they use sparse approximations introduced above but place the inducing inputs on a regular grid to obtain structured matrices. Wilson et al. (2016b,a) use these approximations to present deep kernel learning, an approach to learn flexible kernels.

3

U N D E R S TA N D I NG P RO BA B I L I S T I C S PA R S E G AU S S I A N

P RO C E S S A P P ROX I M AT I O N S

In this chapter, we aim to thoroughly investigate and characterise the difference in behaviour of two popular sparse probabilistic Gaussian process approximations: FITC (Section 2.3) and VFE (Section 2.4). We investigate the biases of their objectives when learning hyperparameters, how and where each method allocates its modelling capacity, and their optimisation behaviour. We discuss the theoretical and practical properties of the two approaches. Our aim is to understand the approximations in detail, in order to know under which conditions each method is likely to succeed or fail in practice. We highlight issues which may arise in practical situations, and how to diagnose and mitigate them. Some of the properties of the methods have been previously reported in the literature, our aim here is a more complete and comparative approach.

This chapter is based on the conference paper ‘Understanding Probabilistic Sparse Gaussian Process Approximations’ (Bauer et al., 2016); it is joint work with Mark van der Wilk and Carl E. Rasmussen. My main contributions were to jointly develop the idea and independently devise and perform the mathematical analysis and experiments.

Throughout this chapter, we use the 1D toy-dataset by Snelson and Ghahramani (2006) as a running example to illustrate our findings. We focus on models with Gaussian likelihood and choose the squared exponential automatic relevance detection (ARD) covariance function (Neal,1994; MacKay,1992) that is widely used in practice: kARD(x,x0) =sfexp −1 2(x−x 0)T Λ−1(x−x0) (3.1) whereΛ = diag(`2

1, . . . , `2d)is a diagonald×d-matrix of squared lengthscales. If a

dimensioniis non-informative, its associated lengthscale`i can be set to a large

value, such that this dimension no longer contributes to the covariance. In 1D the ARD covariance function coincides with the standard squared exponential function.

3.1 objective function for probabilistic sparse gp approximations Because of the similarity between the VFE and FITC objective (seeEquations (2.10)

and(2.19)) we introduce a common notation for their respective negative log marginal likelihood (NLML), which we minimise to train the methods

F = N 2 log(2π) + 1 2log|Qff+G| | {z } complexity penalty +1 2y T (Qff +G)−1y | {z } data fit + 1 2σ2 n tr(T) | {z } trace term , (3.2) where

GFITC = diag[Kff−Qff] +σn2I GVFE = σn2I (3.3)

TFITC = 0 TVFE = Kff−Qff. (3.4)

The common objective function has three terms, (i) a data fit term, (ii) a complexity penalty, and (iii) a trace term. Out of these, only the data fit and complexity penalty have direct analogues in the log marginal likelihood of a full GP model (refer toEquation (1.14)).

Term Preference Present in

VFE FITC Data fit 1 2y T(Q ff+G)−1y

3

3

Complexity 1 2log|Qff +G|

3

3

Trace 1 2σ2 n tr(T)

3

7

Figure 3.1:Sketch of configurations preferred by the individual terms of the objective functionEquation (3.2)

Thedata fitterm penalises the data lying outside the covariance ellipseQff+G,

data fit term

seeFigure 3.1top row.

The complexity penalty is the integral of the data fit term over all possible

complexity penalty

observationsy. It characterises thevolumeof possible datasets that are compatible

with the data fit term. This can be seen as an instance ofOccam’s razor(see the

discussion inSection 1.2), by penalising the methods for being able to predict too many datasets, seeFigure 3.1middle row.

fitc can severely underestimate the noise variance, vfe overestimates it 53

Thetrace termin VFE ensures that the objective function is a lower bound to the VFE’s trace term marginal likelihood of the full GP. Without this term, VFE is identical to the earlier

DTC approximation (Seeger et al., 2003) which can grossly over-estimate the marginal likelihood. The trace term penalises the sum of the conditional variances at the training inputs, conditioned on the inducing inputs (Titsias,2009b), see

Figure 3.1bottom row. Intuitively, it ensures that VFE not only models this specific datasetywell, but also approximates the covariance structure of the full GP,Kff.

3.2 fitc can severely underestimate the noise variance, vfe overes-

Documento similar