There is several types of training algorithms which are well known on the literature, there is no a clear winner among those. The most common method to minimize the total error is
gradient descent, which is a first-order iterative optimisation algorithm. The methods used in
this research effort will be described in the next sub-sections, with the exception of the ESN training, which will be described in a different section, due to the importance of it. In the most simplistic approach to train neural networks, we can consider a system that produces a output Y , when a input X is given. Figure 5.11 shows such simple system where (x, y) represents one sample of training data. As discussed in past Sections 5.2.1, training requires
System
X
Y
Fig. 5.11.: Example diagram for a system with input X and output Y.
more than one sample of data to obtain good results. Therefore the training data is defined by the matrix eXand the target (output) matrix with eY, both containing N samples of data. Important to notice that the samples have to be in the correct time order and the training data should represent the system as good as possible. Training a neural network (Fig.
System X Y Neural Network Update w Cost function E(w) - e(w) Y(w)
Fig. 5.12.: Example diagram of training a neural network.
5.12) means that all weights in the vector w, which contains the connection weights and biases, are updated step by step, such that the neural network output (Y (w)) matches the training output data (Y ) target. The objective of this optimisation is to minimize the error
E(w)(cost function) between neural network and system outputs and the training repeats adapting the weights of the weight vector w until one of the two termination conditions becomes active, such conditions are the maximal number of iterations (epochs) is reached
and/or the error is minimized to the goal Estop. Recurrent Neural Networks can be trained using several methods which have been explored in Jaeger, 2005a, the most important and used are Back-propagation Through Time (BPTT), Real-Time Recurrent Learning (RTRL) and Extended Kalman Filtering (EKF) based techniques, being the first two methods the ones used in this research.
BPTT
Back-propagation is a gradient-based technique used to train several types of neural net- works, not only recurrent. Back-propagation Through Time is an adaptation of the well- known back-propagation training method known from feed-forward networks. It is an extension of perceptrons or multi-layered neural networks, thus it employs at least three or more layers of neurons (input, hidden and output) interconnected to every unit of the previous layer. Detailed work on BPTT can be found at Werbos, 1990 Beaufays et al., 1994 Guo, 2013. In feed-forward, the error derivatives are calculated with respect to the weights of one layer, they can be expressed completely in terms of the error derivatives from the layer above, the issue with recurrent neural networks is that they don’t have this ordered layering because the neurons do not form a directed acyclic graph (DAG), therefore a trans- formation is needed for pseudo-converting a RNN into a feed-forward neural network, this process is called unrolling or unfolding and it was showed in Fig. 5.8. The steps of the training method of BPTT can be seen next.
1. For a sample n, the activations are computed using a forward pass as
xm+1i (n) = f ( ∑ j=1,...,Nm
wmijxj(n)) (5.9)
2. Backward compute through m = k + 1, ..., 1, for each unit xm
i the error propagation term δm
i (n)as seen in equation 5.10 for the output layer and equation 5.11 for the hidden layers, δk+1i (n) = (di(n)− yi(n)) δf (u) δu u=zik+1 (5.10) δjm(n) = Nm+1 ∑ i=1 δim+1wmijδf (u) δu u=zm j (5.11)
where 5.12 is the internal state of the neuron xm
i , which is the error back-propagation pass, the error propagation term δm
i (n)represents the error gradient with respect to the potential of the neuron xm
i . zmi (n) = N∑m−1 j=1 xmj −1(n)wmij−1 (5.12) δE δuu=zm j (5.13)
3. Finally, an adjustment to the connection weights is made according to
new wmij−1= wmij−1+ γ T ∑ t−1
δmi (n)xmj−1(n) (5.14)
After every such epoch, the error is computed as
E = ∑
n=1,...,T
d(n)− y(n) 2= ∑ n,...,T
E(n) (5.15)
The training stops when the error falls below a predetermined threshold or when the number of iterations exceeds a predetermined maximum number of them.
The basic gradient descent approach (and its back-propagation algorithm implementation) is notorious for slow convergence, because the learning rate γ must be typically chosen small to avoid instability. Many speed-up techniques are described in the literature (Pla- gianakos et al., 2001), e.g. dynamic learning rate adaptation schemes. One of the problems with this method is the selection of a suitable network topology (number and size of hid- den layers), which can be overcome using prior information, performing systematic search and/or intuition. Like all gradient-descent techniques on error surfaces, back-propagation finds only a local error minimum, such problem can be addressed by adding noise dur- ing training to avoid getting stuck in poor minima, or by repeating the entire learning from different initial weight settings, or by using task-specific prior information to start from an already plausible set of weights. This method can be easily implemented, yet a considerable expertise/experience is a requisite for good results in non-trivial tasks. An implementation and upgrade for the BPTT was made as a comparison of several training methodologies that were used on this research effort, such implementation made usage of the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) and is similar to Nawi et al., 2006. The BFGS algorithm is an iterative method for non-linear optimisation problems. This al- gorithm is a second order optimisation method that uses rank-one updates specified by evaluations of the gradient to approximate the Hessian matrix. The gradient is calculated
using a BPTT technique based on Werbos, 1990. Experiment results using this algorithms will be showed on Chapter 6.
RTRL
Real-Time Recurrent Learning (RTRL) is a gradient-descent method which computes the exact error gradient at every time step. This method has been analysed and reviewed in Chow et al., 1998 Williams et al., 1989 Cios et al., 1997. If the network activation of the internal (Eq. 5.7) and output units (Eq. 5.8) are differentiated with the weights, the effect of the network dynamics change. The derivative of an internal or output unit with respect to a weight wklis given by equation 5.16, where wklincludes all input, internal and output units weights. Equation 5.16 constitutes a (N + L)-dimensional discrete-time linear dynamical system with time-varying coefficients, with dynamical variable 5.17
δvi(n + 1) δwkl = f′(zi(n)) [ ( N +L∑ j=1 wij δvi(n) δwkl ) + δikvl(n) ] (5.16) ( δvi δwkl , ...,δvN +L δwkl ) (5.17) Where i = 1, ..., N +L, zi(n)is the unit potential, δikvl(n)represents the effect of the weight
Wkl onto the unit k. Since the initial state of the network is independent of the connection weights, it can be initialized by making 5.17 equal to 0. Thus the computation forward in time can be done by iterating 5.16 simultaneously with the recurrent neural network dynamics 5.7 and 5.8, and the error gradient can be calculated as
δE δwkl = 2 T ∑ n=1 N +L∑ i=N (vi(n)− di(n)) δvi(n) δwkl (5.18)
And the new weight update after a complete iteration or epoch can be done with a standard batch gradient descent algorithm as showed in 5.19, where γ is the learning rate.
new wkl = wkl− γ δE δwkl (5.19) wkl(n + 1) = wkl(n)− γ L ∑ i=1 (vi(n)− di(n)) δvi(n) δwkl (5.20)
An alternative update scheme is the gradient descent of current output error at each time step (Eq. 5.20), important to notice that it is assumed that wkl is a constant, therefore a small learning rate must be used. Equation 5.20 is known as real-time recurrent learning.
In RTRL the computational cost for each update step is very high, because a (N + L)- dimensional system (Eq. 5.16) must be solved for each of the weights, therefore this method is only recommended for small networks. An implementation of the Levenberg–Marquardt (LM) algorithm was made as a comparison of training methodologies for RNN. LM is a second order optimisation method that uses the Jacobian matrix to approximate the Hessian matrix, the Jacobian matrix is calculated using the RTRL method from Williams et al., 1989. The LM algorithm Levenberg et al., 1944, Marquardt, 1963 is used to solve non-linear least squares and is also known as the damped least-squares method (DLS).