• No se han encontrado resultados

Two different perspectives from which deep neural networks can be understood have been discussed in this thesis. One is based on the idea of incrementally capturing the data manifold through stacking multiple layers of nonlinear hidden units. The other considers a feedforward computation of a deep neural network as performing an approximate inference of the posterior distribution over hidden units in higher layers.

The latter perspective essentially divides a deep neural network into two distinct parts. The first part is a latent variable model that models the distribution of train- ing samples without labels, and then the second part makes a decision based on the inferred posterior distribution given a sample, on which class the sample belongs to. For example, a forward pass computation up until the penultimate layer of a multi- layer perceptron (MLP, see Section 3.1) would correspond to performing an approx- imate inference, and the computation from the penultimate layer to the output layer to decision-making.

In this framework of a latent variable model, the most exact way of performing classification or decision-making is to marginalize out hidden variableshto obtain the conditional distribution of missing variablesxmgiven the states of the observed

variablesxo:

p(xm|xo) =

h

p(xm|h)p(h|xo).

A deep neural network, however, replaces this with a parametric nonlinear function that computes

h

p(xm|h)Q(h|xo)EQ[p(xm|h]

in a single sweep, since this exact marginalization is often computationally intractable. One consequence of assuming a simple, unimodal distributionp(xm|h), is that the

approximate predictive distributionp˜(xm |xo)loses most of the information in the

true predictive distributionp(xm|xo). This is due to the unimodality ofp˜(xm|xo)

while the true predictive distribution could have many probabilistic modes.

The inherent limitations of this approximate approach employed by deep neural networks are obvious.7 There is no guarantee that the parametric form employed by a deep neural network of an approximate posterior distributionQis good enough to make the above approximation close to the exact marginalization. Furthermore, a usual method of fitting the variational parameters ofQto minimize the Kullback- Leibler divergence betweenQand the true posterior distributionp(h | xo)tends

7Note that the discussion in this section has been highly motivated and influenced by Section 5 of (Bengio, 2013).

to find only a single mode of the true posterior distribution, but we usually cannot tell how representative the found mode is. If the true posterior distribution is highly multi-modal, this approximation based on an arbitrary mode will be generally poor. Lastly, it is not clear how a deep neural network can cope with a more flexible setting where the observed components are not fixed a priori.8

A way has been proposed to overcome each of these limitations. For instance, one may bypass the problem of marginalization or approximate inference by directly mapping from an inputxoto the distribution ofxmgivenxowhich may have been

learned by another model such as a restricted Boltzmann machine (Mnih et al., 2011). In this way, a deep neural network can learn to approximate a true predictive distribu- tionp(xm|xo)without going through an extra step of approximation in the middle.

If the ultimate goal is to make a decision that maximizes the predictive perfor- mance, we still need to be able to evaluate either approximately or exactly the multi- modal predictive distribution quickly and well. This brings us back to one of the limitations of the current approach based on the approximate inference of hidden variables.

A Boltzmann machine provides a principled way to overcome the problem of hav- ing to use an approximate inference of the posterior distribution over hidden vari- ables. Instead of an variational approximation, one may perform an (asymptotically) exact inference utilizing Markov chain Monte Carlo (MCMC) sampling such that

p(xm|xo) 1 T T t=1 p(xm|ht),

wherehtis thet-th sample fromp(h|xo). In fact, it is natural with a Boltzmann

machine to consider any combination of observed components and missing compo- nents. Therefore this approach is tempting, but as the size of a model grows and the number of modes in the predictive distribution increases, it becomes impractical to use MCMC sampling for making a rapid decision.

Hence, we want to have a radically new neural network that keeps the best of the two types of deep neural networks discussed throughout this thesis. A fast and ef- ficient computation of feedforward neural networks (see Chapter 3) is required for a rapid decision-making, while most information and structure contained in a com- plex multi-modal predictive distribution must be maintained, just like a Boltzmann machine is able to learn a multi-modal distribution (see Chapter 4). To the author’s current knowledge, there is no such neural network at the moment.9 It is, hence,

8In a classical setting of, for instance, classification, we know in advance that label compo- nents are not going to be observed but all other components are.

9A few recent works are showing some promising directions using stochastic neural networks Bengio and Thibodeau-Laufer (see, e.g., 2013); Tang and Salakhutdinov (see, e.g., 2013).

left for future research to build such a neural network that combines these two very different characteristics.

The ultimate goal of deep neural networks and the field of deep learning will be to build a large deep neural network that can learn

f(xo|θ) = arg max xm

p(xm|xo),

whereθdenotes a set of parameters, and the indices of observed and missing com- ponents are not fixed a priori. A deep neural network that computes this function will have to be powerful and flexible enough to consider all different possibilities or modes in the predictive distribution in a single sweep of the network in a feedforward manner.

Documento similar