MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung
Backprop for recurrent
networks
Steady state
• Reward is an explicit function of x.
• The steady state of a recurrent network.
x
i= f W
ijx
j+ b
ij
∑
max
W ,bR x ( )
Recurrent backpropagation
• Find steady state
x = f ( Wx + b )
• Calculate slopes
D = diag { f ′ ( Wx + b ) }
∂ R
−1
− W
T) s =
• Solve for s
( D
∂ x
• Weight update
ΔW = η sx
TSensitivity lemma
∂ R ∂ R
= x
∂ W
ij∂ b
i jInput as a function of output
• What input b is required to make x a steady state?
b
i= f
−1( x
i) − ∑ W x
ij jj
• This is unique, even when output is not a unique function of the input!
Jacobian matrix
b
i= f
−1( x
i) − ∑ W x
ij jj
∂ b
ix
= f
−1′ ( ) δ − W
∂ x
j i ij ij= ( D
−1− W )
ijChain rule
∂ R
= ∑ ∂ R ∂ b
i∂ x
j i∂ b
i∂ x
j= ∑ ∂ R ( D
−1− W )
∂ b
iji i
∂ R
∂ R
= ( D
−1− W
T)
∂ x ∂ b
Trajectories
• Initialize at x(0)
• Iterate for T time steps
x
i( ) t = f W
ijx
j( t − 1 ) + b
ij
∑
max
W ,bR x ( ( ) 1 ,K , x T ( ) )
Unfold time into space
• Multilayer perceptron
– Same number of neurons in each layer – Same weights and biases in each layer
(weight-sharing)
( )
W ,b( )
W ,b W ,b( )
x 0 ⎯ ⎯→ x 1 ⎯ ⎯→L ⎯ ⎯→ x T
Backpropagation through time
• Initial condition x(0)
x ( t ) = f ( Wx ( t − 1 ) + b ( t ) )
• Compute
R / ∂ x ( t )
• Final condition s(T+1)=0
∂ R
( ) = D t W
Ts t + 1 ) + D t
s t ( ) ( ( )
∂ x t ( )
Δ W = η ∑ s t ( ) x t ( − 1 )
TΔ b = η ∑ s t ( )
t t
Input as a function of output
x ( t ) = f ( Wx ( t − 1 ) + b ( t ) )
( ) = f
−1( ( ) ) − Wx t − 1 )
b t x t (
x ( 1 ) , x ( 2 ) ,K , x ( T − 1 ) , x ( T )
b ( 1 ) , b ( 2 ) ,K , b ( T − 1 ) , b ( T )
Jacobian matrix
( ) = f
−1( x t (
t
b t ( ) ) − Wx t − 1 )
−1
t
∂ b
i( )
= δ
tt'( D ( ) ) − W δ
∂ x
j( ) t '
ij ij t−1,t'( ) = diag { f Wx t − 1 ) + b t
D t ( ( ( ) ) }
Chain rule
∂ R
∂ x
j( ) t ′ =
∂ R
∂ b
i( ) t
∂ b
i( ) t
∂ x
j( ) t ′
i,t ′
∑
= s
i( ) t ′
i
∑ ( D−1( ) t ′ )
ij − s
i ( t ' +1 ) W
ij
i
∑
∂ R
∂ x t ( ) = D
−1