• No se han encontrado resultados

7 “DIBUJO CORTADO DE UN DESNUDO MASCULINO”

CATÁLOGOS

7 “DIBUJO CORTADO DE UN DESNUDO MASCULINO”

Recall from Section 1.2.2, that RNNs compute a mapping from the input sequence to a corresponding sequence of real-valued hidden state vectors of dimension D:

RNN ([y1, . . . , yT]) = [h1, . . . , hT] , ht∈ RD. (3.130) The hidden state ht is a flexible representation, meant to summarize the sequence seen up to time t, that enables us to use the network for different tasks by defining an appropriate output layer. RNNs can be seen as non-linear dynamic systems that define the computation of the hidden state recursively. In graphical terms, they are a generalization of the directed, acyclic graph structure of feed-forward networks by allowing cycles to represent dependencies on the past state of the network. The RNNLM uses a simple network architecture that goes back to Elman [1990] and expresses the dependency on the past through the following recursive definition of the hidden state vectors:

at= Whht−1+ Winδyt,

ht= σ(at). (3.131)

The recurrence is initialized by a constant vector h0 = 1 with small  ≥ 0. The matrices

Wh∈ RD×D and W

in∈ RD×I are parameters of the RNN and σ(·) is a non-linear function that is applied element-wise to its input, such as the logistic sigmoid, the hyperbolic tangent or, more recently, the rectified linear function. The input is presented to the network as one-hot encoded vectors denoted by δyt, in which case the corresponding matrix-vector product Winδyt reduces to projecting out the yt-th column of Win. Thus, the columns of Wincan be thought of as latent features describing he corresponding item. Note, that the network can be trivially extended to accept arbitrary side information that characterizes the input at time t. In practice, more sophisticated architectures implementing (3.130) are in use [Hochreiter and Schmidhuber, 1997, Cho et al., 2014b]. To obtain a distribution over the next item, we can linearly map the hidden state toRI using a matrix Wout ∈ RI×D and pass the output vector yt∈ RI through the softmax function from (3.129):

zt= Woutht,

P (yt+1| y<t+1) = σm(zt, yt+1) . (3.132)

This likelihood together with the recursion in (3.131) enables us to sample sequences, by drawing from the multinomial (3.132) and presenting the sample as input to the network for the next timestep.

The network is parameterized by θ = {Win, Wh, Wout} and can be trained by using maximum likelihood. The resulting continuous optimization problem is solved using stochastic gradient descent, where gradients are approximated using backpropagation through time [BPTT, Williams and Zipser, 1995], which we briefly describe below and amounts to unrolling the recurrence described by the forward equations (3.131) for a fixed number of steps.

Parameter Learning. We describe how the RNNLM processes a single sequence as we will use this procedure as a building block for learning the model in Section 3.3.4.2. As loss function Mikolov et al. [2010] use the log-probability of sequences (see Eq. 3.120), which is differentiable with respect to all parameter matrices. The network is trained using backpropagation through time [BPTT, Williams and Zipser, 1995]. BPTT computes the gradient by unrolling the RNN in time and by treating it as a multi-layer feed-forward neural network with parameters tied across every layer and error signals at every layer. For computational reasons, the sequence unrolling is truncated to a fixed size B. This is a popular approximation for processing longer sequences computationally more efficiently. This method is summarized in Algorithm 7. To train on a full corpus, Algorithm 7 is used on individual sentences in a stochastic fashion. The learning rule in the original RNNLM is a gradient descent step with a scalar step size10. The training process is regularized by using early stopping and small amounts of Tikhonov regularization on the network weights.

Algorithm 7 processSequence()

Require: Sequence y, θ ={Win, Wh, Wout}, batch size B

Ensure: Updated parameters θ

1: B ← Split y into sub-sequences of length B

2: h0← 1

3: for b∈ B do

4: h1, . . . , hB ← RNN (b, h0; θ) (Forward pass using Eq. 3.131) 5: θlog E ← BPTT (b, h1, . . . , hB; θ) (Backward pass, see below) 6: θ ← LearningRule (∇θ, θ)

7: h0 = hB

8: end for

Backpropagation Through Time. We derive the BPTT equations for the RNNLM for the gradient computations in Algorithm 7. This is done in the same manner as back-propagation is derived for feed-forward networks. The unrolled RNN differs from a feed-forward network by having input and error terms at each layer. Furthermore, gradients are approximate due to truncation. Note, that modern neural network libraries

10There are more nuances to their algorithm, but for the purpose of our development this basic

do not require writing gradient code, which is especially useful for more complex RNN architectures. Nevertheless, we found it instructive to understand the basic computations involved in training neural networks.

Based on the forward equations (3.131) and (3.132), we define the sequence likelihood using the following quantities:

pt:= [σm(zt, i)]i∈I (3.133)

t:= log σm(zt, yt+1) = log pt,y

t+1 (3.134)

L :=&B−1t=0 t (3.135)

To compute θL, it is convenient to first compute the gradients atL,∀t. The forward

equations reveal the dependencies on a particular at as being two-fold: both tand at+1 contribute toatL. Therefore, the gradient atL is given by (assuming row vectors)

atL =∇att+ k ∂L ∂at+1,k ∂at+1,k ∂at =att+ (at+1L)(Δatat+1) (3.136) The recursion is initialized byaB−1L =∇aB−1B−1 since only the output B−1 depends on the last activation. The gradient and Jacobian used above are given by:

att= i ∂t ∂yt,i ∂yt,i

∂at = (δyt+1− pt)TWoutdiag 

σ(at) (3.137)

Δatat+1= Whdiagσ(at) (3.138)

Now, the parameter gradients can be easily obtained using W·L =&t,k∂a∂L

t,k ∂at,k ∂W· and ∂at,k ∂Wh = ∂Whδ T k (Whht−1+ Winδyt) = δkhTt−1 (3.139) ∂at,k ∂Win = ∂Winδ T k(Whht−1+ Winδyt) = δkδTyt (3.140) Plugging this in, we get

WhL = t (atL) hTt−1 (3.141) WinL = t (atL) δTy t (3.142)

The output weights are not affected by the recurrent structure. Thus, we only need to accumulate terms