7 “DIBUJO CORTADO DE UN DESNUDO MASCULINO”

CATÁLOGOS

7 “DIBUJO CORTADO DE UN DESNUDO MASCULINO”

Recall from Section 1.2.2, that RNNs compute a mapping from the input sequence to a corresponding sequence of real-valued hidden state vectors of dimension D:

RNN ([y1, . . . , yT]) = [h1, . . . , hT] , ht∈ RD. (3.130) The hidden state h_t is a flexible representation, meant to summarize the sequence seen up to time t, that enables us to use the network for different tasks by defining an appropriate output layer. RNNs can be seen as non-linear dynamic systems that define the computation of the hidden state recursively. In graphical terms, they are a generalization of the directed, acyclic graph structure of feed-forward networks by allowing cycles to represent dependencies on the past state of the network. The RNNLM uses a simple network architecture that goes back to Elman [1990] and expresses the dependency on the past through the following recursive definition of the hidden state vectors:

a_t= W_hh_t−1+ W_inδ_y_t,

h_t= σ(a_t). (3.131)

The recurrence is initialized by a constant vector h0 = 1 with small ≥ 0. The matrices

W_h∈ RD×D _{and W}

in∈ RD×I are parameters of the RNN and σ(·) is a non-linear function that is applied element-wise to its input, such as the logistic sigmoid, the hyperbolic tangent or, more recently, the rectiﬁed linear function. The input is presented to the network as one-hot encoded vectors denoted by δ_y_t, in which case the corresponding matrix-vector product W_inδ_y_t reduces to projecting out the y_t-th column of W_in. Thus, the columns of Wincan be thought of as latent features describing he corresponding item. Note, that the network can be trivially extended to accept arbitrary side information that characterizes the input at time t. In practice, more sophisticated architectures implementing (3.130) are in use [Hochreiter and Schmidhuber, 1997, Cho et al., 2014b]. To obtain a distribution over the next item, we can linearly map the hidden state toRI using a matrix Wout ∈ RI×D and pass the output vector yt∈ RI through the softmax function from (3.129):

z_t= W_outh_t,

P (y_t+1| y_<t+1) = σ_m(z_t, y_t+1) . (3.132)

This likelihood together with the recursion in (3.131) enables us to sample sequences, by drawing from the multinomial (3.132) and presenting the sample as input to the network for the next timestep.

The network is parameterized by θ = {Win, Wh, Wout} and can be trained by using maximum likelihood. The resulting continuous optimization problem is solved using stochastic gradient descent, where gradients are approximated using backpropagation through time [BPTT, Williams and Zipser, 1995], which we brieﬂy describe below and amounts to unrolling the recurrence described by the forward equations (3.131) for a ﬁxed number of steps.

Parameter Learning. We describe how the RNNLM processes a single sequence as we will use this procedure as a building block for learning the model in Section 3.3.4.2. As loss function Mikolov et al. [2010] use the log-probability of sequences (see Eq. 3.120), which is differentiable with respect to all parameter matrices. The network is trained using backpropagation through time [BPTT, Williams and Zipser, 1995]. BPTT computes the gradient by unrolling the RNN in time and by treating it as a multi-layer feed-forward neural network with parameters tied across every layer and error signals at every layer. For computational reasons, the sequence unrolling is truncated to a fixed size B. This is a popular approximation for processing longer sequences computationally more efficiently. This method is summarized in Algorithm 7. To train on a full corpus, Algorithm 7 is used on individual sentences in a stochastic fashion. The learning rule in the original RNNLM is a gradient descent step with a scalar step size10. The training process is regularized by using early stopping and small amounts of Tikhonov regularization on the network weights.

Algorithm 7 processSequence()

Require: Sequence y, θ ={W_in, W_h, W_out}, batch size B

Ensure: Updated parameters θ

1: B ← Split y into sub-sequences of length B

2: h0← 1

3: for b∈ B do

4: h1, . . . , hB ← RNN (b, h0; θ) (Forward pass using Eq. 3.131) 5: ∇_θlog E ← BPTT (b, h1, . . . , h_B; θ) (Backward pass, see below) 6: θ ← LearningRule (∇_θ, θ)

7: h0 = hB

8: end for

Backpropagation Through Time. We derive the BPTT equations for the RNNLM for the gradient computations in Algorithm 7. This is done in the same manner as back-propagation is derived for feed-forward networks. The unrolled RNN diﬀers from a feed-forward network by having input and error terms at each layer. Furthermore, gradients are approximate due to truncation. Note, that modern neural network libraries

10_{There are more nuances to their algorithm, but for the purpose of our development this basic}

do not require writing gradient code, which is especially useful for more complex RNN architectures. Nevertheless, we found it instructive to understand the basic computations involved in training neural networks.

Based on the forward equations (3.131) and (3.132), we deﬁne the sequence likelihood using the following quantities:

p_t:= [σ_m(z_t, i)]_i∈I (3.133)

_t:= log σ_m(z_t, y_t+1) = log p_t,y

t+1 (3.134)

L :=&B−1_t=0 _t (3.135)

To compute ∇_θL, it is convenient to ﬁrst compute the gradients ∇_a_tL,∀t. The forward

equations reveal the dependencies on a particular a_t as being two-fold: both _tand a_t+1 contribute to∇_a_tL. Therefore, the gradient ∇_a_tL is given by (assuming row vectors)

∇atL =∇att+ k ∂L ∂a_t+1,k ∂a_t+1,k ∂a_t =∇att+ (∇at+1L)(Δatat+1) (3.136) The recursion is initialized by∇_a_B−1L =∇_a_B−1_B−1 since only the output _B−1 depends on the last activation. The gradient and Jacobian used above are given by:

∇att= i ∂_t ∂y_t,i ∂y_t,i

∂a_t = (δyt+1− pt)TWoutdiag

σ(a_t) (3.137)

Δ_a_ta_t+1= W_hdiagσ(a_t) (3.138)

Now, the parameter gradients can be easily obtained using ∇_W_·L =&_t,k_∂a∂L

t,k ∂at,k ∂W· and ∂a_t,k ∂W_h = ∂ ∂W_hδ T k (Whht−1+ Winδyt) = δkhTt−1 (3.139) ∂a_t,k ∂W_in = ∂ ∂W_inδ T k(Whht−1+ Winδyt) = δkδTyt (3.140) Plugging this in, we get

∇WhL = t (∇_a_tL) hT_t−1 (3.141) ∇WinL = t (∇_a_tL) δT_y t (3.142)

The output weights are not aﬀected by the recurrent structure. Thus, we only need to accumulate terms

In document Ricardo Bellver y Ramón: su obra escultórica: un estudio historiográfico y documental (página 139-143)