• No se han encontrado resultados

We demonstrate our approach on a series of experiments on variants of the MNIST datasets. While MNIST has been accurately solved by other methods, we intend to show that a model like an RBF GP (Radial Basis Function or squared exponential kernel), for which MNIST is challenging, can be significantly improved by learning adequate invariances and insensitivities.

4.5.1 Augmentation distributions

For binary classification tasks, we use the Pòlya-Gamma approximation for the logistic likelihood, while for multi-class classification, we are currently forced to use the Gaussian likelihood.

Here, we consider two classes of transformations for which we automatically learn parameters: (i) global affine transformations (scale, rotation, shear, and trans- lation), and (ii) local deformations as manually employed in the infiniteMNIST dataset (Loosli et al.,2007). Arguably, these are two of the simplest differentiable and parametrisable transformations that can give rise to augmentation sets. We address the advantages and disadvantages of this choice as well as possible alternatives at the end of this chapter inSection 4.6. Note that we must be able to backpropagate through the transformations in order to learn their parameters. affine transformations

A 2D affine transformation is determined by six degrees of freedom, which specify scaling, global rotation, shear, and translation. There are different ways to parameterise these degrees of freedom. A general affine trans- formation is defined through a transformation matrixAφwith six affine

parametersφ, which map a point(x, y)in the 2D plane to a new point(x0, y0) through:    x0 y0 1   =    φ1 φ2 φ3 φ4 φ5 φ6 0 0 1       x y 1    . (4.44)

Note that here (and only here)xand y do not denote input and output but coordinates of a 2D coordinate system. We can think ofAφbeing the

result of the product of several simpler transformation matrices, where each matrix encodes only rotations, scalings, shears, and translations. For

empirical evaluation 87

example, a more interpretable parameterisation would be a rotation by an angleαand a scaling

Aφ=    sx 0 0 0 sy 0 0 0 1       cosα sinα 0 sinα cosα 0 0 0 1    . (4.45)

To define the augmentation distributionp(xa|x), we first define a distribu-

tion on these affine parameterspψ(φ), which itself is parametrised through

parametersψ. Then, to sample fromp(xa | x), we first drawφ ∼ pψ(φ)

and then apply the corresponding transformation to the inputxto obtain the augmented samplexa= Affφ(x). Because affine transformations are

differentiable with respect to the affine parametersφ, we can backpropagate any loss into the parametersψofpψ(φ)using the reparameterisation trick –

as long aspψ(φ)is reparameterisable with respect toψ.

Here, we again make the simplest possible assumption thatpψ(φ)factorises

and each factor is given by a uniform distribution with learnable bounds, that is, minimal and maximal values,

pψ(φ) = 6 Y

i=1

pψi(φi), pψi(φi) = Unif (φi,min, φi,max). (4.46) Thus, the learnable parameters ψ are given by ψ = {φi,min, φi,max | i =

1, . . . ,6}. In other words, for each affine parameter we learn the boundaries

of the range in which two samples are said to be invariant or insensitive. For example, for rotations the boundary values might be φα,min = −15°

and φα,max = 25°, such that the augmentation set of an image would be

given by all possible rotations between those angles (we have neglected the other 5 affine transformations in this example). In practice, we use the implementation of affine transformations used in spatial transformer networks (Jaderberg et al.,2015) fromhttps://github.com/kevinzakka/ spatial-transformer-network, which is fully differentiable with respect

to the affine parameters. local deformations

As a second class of transformations we consider local deformations as introduced with the infiniteMNIST dataset (Loosli et al., 2007; Simard et al.,2000). The local deformations can be decomposed into random local

deformations, local scalings, local rotations, and a thickening/thinning transformation. A transformed imagexais given by

xa=σ x+fxtx(x) +fyty(x) +β q tx(x)2+ty(x)2 (4.47) fx= (αxfrand,x+αrotfrot,x+αscalefscale,x) (4.48)

fy = (αyfrand,y+αrotfrot,y+αscalefscale,y) (4.49)

Here,σis a clipping or squashing function to ensure thatxais in the same

range asx; frand/rot/scale are vector fields for local random deformations,

local rotations and local scalings with corresponding scale factorsαx/y,αrot andαscale;tx/y(x)denotes the image gradients of an imagexin direction

x/y; andβscales a thinning/thickening transformation.

The image gradients tx/y(x)can be computed by convolving the image x with a derivative of Gaussian filter (Simard et al., 2000). However, to increase performance, we approximate them by convolutions with Sobel filters (Kanopoulos et al., 1988) in the x/y direction, respectively. The vector fields for local rotations,frot,x/y, and scalings,fscale,x/y, are fixed and

can be derived as first order approximations of the corresponding affine transformations. The vector fields for random local deformationsfrand,x/y

are drawn i.i.d. from a diagonal standard Normal distribution and then correlated spatially by a Gaussian smoothing kernel of widthσd.

Similar to the case of affine transformations, we place factorised uniform distributions on the scale parameters α and β and learn their ranges together with the scale parameterσdof the random local deformations.

4.5.2 Recovering invariances in binary MNIST classification

As a first test, we demonstrate that our approach can recover the parameter of a known transformation in an odds-vs-even MNIST binary classification problem using the Pòlya-Gamma likelihood fromSection 4.4.6. We consider the regular MNIST dataset and rotate each example by a randomly chosen angle φ∈[−αtrue, αtrue]forαtrue∈ {90°,180°}.

To create samples from the orbitA(x), we limit ourselves to rotations, that is, we transform the training examples by affine transformations that are limited to rotations, that is Aφ= cosφ sinφ 0 sinφ cosφ 0 ! φ∼Uniform([−α, α]). (4.50)

empirical evaluation 89

We choosep(xa|x)to be a uniform distribution over rotated images, leading to a

rotational invariance, and use the variational lower bound to train the bounds of rotationα. To perform well on this task, we expect the recovered boundsαto be at least as large as the true valueαtrueto account for the rotational invariance. Too large values, i.e.α≈180°, should be avoided due to ambiguities between, for example, 6s and 9s. 0 2 4 Wall time (s) ·104 0 45◦ 90◦ 135◦ 180◦ Recover ed angle αtrue=90◦ 0 2 4 Wall time (s) ·104 αtrue=180◦

αinit= 5° αinit= 45° αinit= 90° αinit= 135° αinit= 175°

0 0.2 0.4 0.6 0.8 1 Wall time (s)·105 0.04 0.06 0.08 0.1 0.12 Test err or αtrue=90◦ 0 0.2 0.4 0.6 0.8 1 Wall time (s)·105 −2 −1.5 −1 ·104 Log mar ginal likelihood αtrue=90◦

αinit= 45° αinit= 175° fixedαtrue RBF

Figure 4.2:Binary classification on the partially rotated (by up to±90°or±180°) MNIST

dataset.Top:Recovered angles during optimisation for different initialisations αinitfor the two casesαtrue= 90°andαtrue= 180°.Bottom:Evolution of the

error on a test set and the log marginal likelihood bound throughout the optimisation.

We find that the trained GP models with invariances are able to approxim- ately recover the true angles (Figure 4.2,top). Whenαtrue = 180°, the angle is

under-estimated, whereasαtrue= 90°is recovered well. Regardless, all models outperform the simple RBF GP by a large margin, both in terms of error, and in terms of the marginal likelihood bound (Figure 4.2, bottom). These results

show that our approach can be combined effectively with certain non-Gaussian likelihood models using the Pòlya-Gamma trick. Note that the marginal likelihood also identifies the “correct” angle as it assigns the best (highest) values to the model with angle fixed toαtrue.

4.5.3 Classification of MNIST digits

Next, we consider full MNIST classification using a Gaussian likelihood, and compare the non-invariant RBF kernel to various invariant kernels.Figure 4.3

shows that the GPs with invariant kernels clearly outperform the baseline RBF kernel. Both types of learned invariances, affine transformations and local deform- ations, lead to similar performance for a wide range of initial conditions. When constrained to rotational invariances only, the results are only slightly better than the baseline. This indicates that deformations (stretching, shear) are more import- ant than rotations for MNIST. Crucially, we do not require a validation set, but can use the log marginal likelihood of the training data to monitor performance.

Table 4.1lists the final test errors for the different models. InFigure 4.4we show samples fromp(xa|x)for the trained model that uses all affine transformations

in its covariance function.

0 0.2 0.4 0.6 0.8 1 Wall time (s)·105 0.01 0.015 0.02 0.025 0.03 Test err or 0 0.2 0.4 0.6 0.8 1 Wall time (s)·105 4 4.5 5 5.5 ·105 Log mar ginal likelihood

(non-invariant) RBF rotation only all affine local deformations

Figure 4.3:MNIST classification results using a Gaussian likelihood.Left:Test error.Right:

Log marginal likelihood bound. All invariant models outperform the RBF baseline.

Invariances of kernel Error in %

(non-invariant) RBF 2.15±0.03

only rotations 2.08±0.06

all affine transformations 1.35±0.07 local deformations 1.47±0.05

Table 4.1:Final test error for MNIST classification results using a Gaussian likelihood.

4.5.4 Classification of rotated MNIST digits

We also consider the fully rotated MNIST dataset1. In this case, we only run GP models that are invariant to affine transformations. InFigure 4.5we compare

discussion and outlook 91

Figure 4.4:Samples fromp(xa | x)describing thelearned invariancefor eight example

MNIST digitsx(squares). The method becomes insensitive to the rotations,

shears, and rotations that are present in the training set.

general affine transformations (learning bounds for all six affine parameters), rotations with learned angle bounds (no scalings, shears, or translations), and fixed rotational invariance. We found that all invariant models outperform the baseline (RBF) by a large margin. However, the models with fixed angles (no free parameters in the transformation) outperform their learned counterparts. We attribute this to the optimisation dynamics, as the problem of optimising the variational, kernel, and transformation parameters jointly is more difficult than optimising only variational and kernel parameters for fixed transformations. We emphasise that the marginal likelihood bound does correctly identify the best-performing invariance in this case as well (Figure 4.5(right)).

0 0.2 0.4 0.6 0.8 1 Wall time (s)·105 0.05 0.1 0.15 0.2 Test err or 0 0.2 0.4 0.6 0.8 1 Wall time (s)·105 0 2 4 ·104 Log mar ginal likelihood

(non-invariant) RBF rotation (fixed at±45°) rotation (fixed at±179°)

rotation (learned) all affine trafos (learned) Figure 4.5:Rotated MNIST classification results. Left: Test error. Right:Log marginal

likelihood bound. The optimiser has difficulty finding a good solution with the learned invariances, although the marginal likelihood bound correctly identifies the best model.

Documento similar