Backpropagation learning
MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung
Simple vs. multilayer
perceptron
Hidden layer problem
• Radical change for the supervised learning problem.
• No desired values for the hidden layer.
• The network must find its own hidden layer activations.
Generalized delta rule
• Delta rule only works for the output layer.
• Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers
Outline
• The algorithm
• Derivation as a gradient algoritihm
• Sensitivity lemma
Multilayer perceptron
• L layers of weights and biases
• L+1 layers of neurons
x
0 W⎯ → ⎯
1,b1x
1 W⎯ → ⎯
2 ,b2L
W⎯ → ⎯
L,bLx
Lx
il= f W
ijlx
jl−1+ b
ilj=1 nl−1
⎛ ∑
⎝ ⎜ ⎞
⎠
Reward function
• Depends on activity of the output layer only.
• Maximize reward with respect to weights and biases.
R x ( )
LExample: squared error
• Square of desired minus actual output, with minus sign.
R x ( )
L= − 1
2 ( d
i− x
iL)
2i=1 nL
∑
Forward pass
For l = 1 to L,
u
il= W
ijlx
jl−1+ b
ilj=1 nl−1
∑
x
il= f u ( )il
Sensitivity computation
• The sensitivity is also called “delta.”
s
iL= ′ f u ( )iL ∂ ∂ R
x
iL= ′ f u ( )iL ( d
i − x
iL)
Backward pass
for l = L to 2
s
jl−1= ′ f u ( )jl−1 s
il
i=1 nl
∑ W
ijlLearning update
• In any order
∆W
ijl= η s
ilx
jl−1∆b
il= η s
ilBackprop is a gradient update
• Consider R as function of weights and biases.
∂ R
∂ W
ijl= s
ilx
jl−1∂ R
∂ b
il= s
il∆W
ijl= η ∂ R
∂ W
ijl∆b
il= η ∂ R
∂ b
ilSensitivity lemma
• Sensitivity matrix = outer product
– sensitivity vector – activity vector
• The sensitivity vector is sufficient.
• Generalization of “delta.”
∂ R
∂ W
ijl= ∂ R
∂ b
ilx
jl−1Coordinate transformation
u
il= W
ijlf u ( )jl−1 + b
il
j=1 nl−1
∑
∂ u
il∂ u
jl−1= W
ijlf u ′ ( )jl−1
∂ R
∂ u
il= ∂ R
∂ b
ilOutput layer
x
iL= f u ( )iL
u
iL= W
ijLx
jL−1+ b
iLj
∑
∂ R
∂ b
iL= ′ f u ( )iL ∂ ∂ R
x
iLChain rule
• composition of two functions
u
l−1→ u
l→ R u
l−1→ R
∂ R
∂ u
jl−1= ∂ R
∂ u
il∂ u
il∂ u
jl−1i
∑
∂ R
∂ b
jl−1= ∂ R
∂ b
ilW
ijlf u ′ ( )jl−1
i
∑
Computational complexity
• Naïve estimate
– network output: order N
– each component of the gradient: order N – N components: order N2
• With backprop: order N
Biological plausibility
• Local: pre- and postsynaptic variables
• Forward and backward passes use same weights
• Extra set of variables
x
jl−1 Wijl
⎯ → ⎯ x
il, s
jl−1 Wijl
← ⎯
⎯ s
ilBackprop for brain modeling
• Backprop may not be a plausible account of learning in the brain.
• But perhaps the networks created by it are similar to biological neural networks.
• Zipser and Andersen:
– train network
– compare hidden neurons with those found in the brain.
LeNet
• Weight-sharing
• Sigmoidal neurons
• Learn binary outputs
Machine learning revolution
• Gradient following
– or other hill-climbing methods
• Empirical error minimization