• No se han encontrado resultados

Throughout this thesis, we have mostly concentrated on unsupervised neural net- works. There are tasks to which these unsupervised neural networks can trivially be applied.

For instance, Burger et al. (2012), Xie et al. (2012) and Cho (2013) (Publication IX) recently showed that (denoising) autoencoder and restricted/deep Boltzmann ma- chines can denoise large, corrupted images. The performance of these neural net- works was shown to be comparable to, or sometimes better than, conventional meth- ods of image denoising such as BM3D (Dabov et al., 2007) or K-SVD (Portilla et al., 2003).

A deep autoencoder initialized by a deep belief network was shown to excel at extracting low-dimensional binary codes for documents (Salakhutdinov and Hinton, 2009b). Also, Salakhutdinov et al. (2007) showed that an RBM can be successfully used for collaborative filtering.

These unsupervised models cannot be used directly for performing any supervised task. This is clear as none of these models have at their disposal known outputs of training samples.

As in the layer-wise pretraining discussed earlier in this chapter, however, the un- supervised neural networks can be used to improve the discriminative performance of supervised models. In the layer-wise pretraining, recursively stacking shallow un- supervised neural networks was shown to extract better representations that are more suitable for classification, which may be improved further by finetuning the whole stack.

In the remainder of this section, we introduce other approaches than the previously discussed layer-wise pretraining that aim to improve the discriminative performance by using unsupervised neural networks. These approaches may use a separate unsu- pervised neural network, or combine a supervised and an unsupervised neural net- work.

5.2.1 Discriminative RBM and DBN

The most straightforward way to perform discriminative tasks with an unsupervised neural network is to model the joint probability distribution of both the input and outputp(x, y). Once the network is trained, one can utilize the joint distribution to perform a prediction on a new samplex.

Regardless of whether the task is classification or regression, the best outputˆygiven a new samplexcan be found by, for instance, the maximum a posteriori (MAP):

ˆ

y= arg max

y

p∗(x∗, y), (5.1) where we have used the unnormalized probabilityp∗to emphasize that there is no need to compute the potentially intractable normalization constant.

Alternatively, one may be interested in computing the expected value of the output

ˆ y= 1 Z y∈Y yp∗(x∗, y), (5.2) whereZ andY are the normalization constant and the set of all possible values fory, respectively. However, in the latter case, the normalization constant which is often computational intractable has to be computed or estimated, which makes it less practical. Hence, in this section, we only focus on the MAP solution for the outputy. Ifycan only have a finite number of possible outcomes (classification), we can simply evaluatep(x∗, y)for all possibley’s and choose theywith the largest value. Otherwise, it is possible to optimizep(x∗, y)with respect toyto compute the best possibley, although it may only find a local mode ifp(x∗, y)has more than one modes.

Let us consider using a restricted Boltzmann machine (RBM, Section 4.4.2) for classification.

First, given a training setD = x(n), y(n)N

n=1, we turn each outputy(n) {1,2, . . . , q}into aq-dimensional vectory(n)whosey(n)-th component is one and all other components are zero. With the transformed output vectors, we create a new training setD˜= (x(n)),(y(n))

N

n=1by concatenatingx

(n)andy(n)for each n.

Then we train an RBM with the transformed setD˜ (see Fig. 5.2(a)) either using, for instance, the stochastic approximation procedure (see Section 4.3.3) or by min- imizing contrastive divergence (see Section 4.4.2). Recalling that the unnormalized probability of x,yafter marginalizing out the hidden units can be efficiently and exactly computed (see Section 4.4.2), we can predict the label of a new sample by Eq. (5.2).

Larochelle and Bengio (2008) proposed adiscriminative objective function for training this kind of an RBM. The proposed objective function maximizes instead

h1 h2 hq x1 x2 xp y1 yK y∈ {1,· · ·, K} x1 x2 xp y1 yK y∈ {1,· · ·, K}

(a) Discriminative RBM (b) Discriminative DBN

Figure 5.2.Illustrations of a discriminative restricted Boltzmann machine and discriminative deep be- lief network. Note that the1-of-Kcoding is used for the output labelywhich may takeK discrete values.

the conditional log-likelihood

Ld(θ) = N n=1

logp(y(n)|x(n),θ).

Furthermore, they showed that a better classification performance can be achieved by maximizing the weighted sum of the log-likelihood and the conditional log-likelihood together.

A similar idea was also presented earlier for a deep belief network (DBN) by Hinton et al. (2006). Instead of augmenting the visible layer with a transformed labely, they augmented the penultimate layer. The augmented units corresponding toyare only connected to the top layer with undirected edges. See Fig. 5.2(b) for illustration.

This model can be trained by the procedure described in Section 4.5.2, however with a slight modification. Firstly, during the first stage of layer-wise pretraining, we augment the posterior distribution of the penultimate layer with the labels of the training samples. During the second stage, where the up-down algorithm is used, the Gibbs sampling steps between the top two layers start from the samples from the (approximate) posterior distribution attached with the labels of the samples in a minibatch.

Once training is over, we can classify a new samplex easily by first obtaining the approximate (fully factorized) posterior means of the penultimate layerμ and computing the unnormalized probabilities of the combination ofμand all possible label statesy. The one that gives the largest unnormalized probability is chosen as a predictionyˆ.

Surprisingly, both of these approaches which perform both generativep(x, y)and discriminativep(y|x)modeling achieve a classification performance comparable to or often better than the models which were trained purely to perform discriminative modeling (Hinton et al., 2006; Larochelle and Bengio, 2008).

5.2.2 Deep Boltzmann Machine to Initialize an MLP

It is straightforward to initialize a multilayer perceptron (MLP) with a deep belief network (DBN) as well as a restricted Boltzmann machine (RBM). Once the parame- ters of those unsupervised neural networks are estimated, we can directly use them as initial parameters of an MLP. This corresponds to the layer-wise pretraining scheme discussed in Section 5.1. However, when it comes to a deep Boltzmann machine (DBM), one must take into account the nature of each layer receiving both bottom- up and top-down signals.

A naive way of utilizing a DBM for a discriminative task in this case is to for- get about transforming it into an MLP, and simply use the approximate posterior means of hidden units as features (see, e.g., Montavon et al. (2012) and Publication VII). In other words, for each samplexwe compute the variational parametersμby maximizing the variational lower bound in Eq. (5.9) with respect to them. Then the obtained variational parameters are used instead of the original sample. However, it is often obvious that a better discriminative performance is achieved when the model is specificallyfinetunedto optimize it.

Salakhutdinov and Hinton (2009a) proposed that the structure of an MLP be mod- ified to simulate the top-down signal in a DBM. Given a DBM withLhidden layers

h[l]Ll=1and a single visible layerx, let us construct an MLP withLintermediate hidden layers

˜ h[l]

L

l=1, a single output layery˜and a single visible layerx˜. The

main goal of this construction is to make sure that a single forward pass results in the states of the units in the penultimate layerh[L]of the MLP being identical to the mean-field approximationμ[L]of them.

The fixed point of the variational parameters of the first hidden layer that locally maximizes the variational lower bound in Eq. (5.9) is

μ[1]=φWx+U[1]μ[2],

whereφis a component-wise logistic sigmoid function. Then, if we let the visible layer of the MLP to bex˜=x,μ[2]and connectxwith the first hidden layer of the MLP byWandμ[2]byU[1], a single forward pass will result in the activation of the first hidden layer of the MLPh˜[1]to be exactlyμ[1].

This applies similarly to all intermediate hidden layers of the DBM. For anyl-th layer of the DBM, wherel < L−1, we constrain thel-th hidden layer of the MLP by appendingh˜[l]withμ[l+2]. By connecting them toh˜[l+1]with the corresponding weights from the DBM, we can ensure that the activation of the(l+ 1)-th hidden layer of the MLP will be initially identical toμ[l+1].

Since the last hidden layer of the DBM only receives the bottom-up signal, there will be no need to construct the last hidden layer in this way. Simply it is enough to

x x h[1] h[1] h[2] h[2] μ[2] W W U U U y

Figure 5.3.A deep Boltzmann machine with two hidden layers, on the left, is transformed to initialize a multilayer perceptron on the right. μ[2]is a vector of the variational parameters of the second hidden layer of the DBM.

connecth˜[L]withh˜[L−1]byU[L−1].

This way of constructing an MLP (see Fig. 5.3) guarantees that the activation of the last hidden layer of the MLP after a forward pass with the initialized weights will coincide with their variational parameters. From there on, we can finetune the model to optimize for classification performance.

For instance, in (Salakhutdinov and Hinton, 2009a) and (Hinton et al., 2012), this way of initializing an MLP with a DBM was shown to improve the performance on handwritten digits as well as 3-D object recognition tasks.

Documento similar