Hidden layer problem

(1)

Backpropagation learning

MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung

(2)

Simple vs. multilayer

perceptron

(3)

Hidden layer problem

• Radical change for the supervised learning problem.

• No desired values for the hidden layer.

• The network must find its own hidden layer activations.

(4)

Generalized delta rule

• Delta rule only works for the output layer.

• Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers

(5)

Outline

• The algorithm

• Derivation as a gradient algoritihm

• Sensitivity lemma

(6)

Multilayer perceptron

• L layers of weights and biases

• L+1 layers of neurons

x

^{0 W}

⎯ → ⎯

¹^,b¹

x

^{1 W}

⎯ → ⎯

² ^,b²

L

^W

⎯ → ⎯

^L^,b^L

x

^L

x

_i^l

= f W

_ij^l

x

_j^l⁻¹

+ b

_i^l

j=1 n_l−1

⎛ ∑

⎝ ⎜ ⎞

⎠

(7)

Reward function

• Depends on activity of the output layer only.

• Maximize reward with respect to weights and biases.

R x ( )

^L

(8)

Example: squared error

• Square of desired minus actual output, with minus sign.

R x ( )

^L

^{= −} ¹

2 ( d

_i

− x

_i^L

)

²

i=1 n_L

∑

(9)

Forward pass

For l = 1 to L,

u

_i^l

= W

_ij^l

x

_j^l−1

+ b

_i^l

j=1 n_l−1

∑

x

_i^l

= f u ( )

_i^l

(10)

Sensitivity computation

• The sensitivity is also called “delta.”

s

_i^L

= ′ f u ( )

_i^L

_∂ ^∂ ^R

x

_i^L

= ′ f u ( )

_i^L

( ^d

ⁱ

⁻ ^x

ⁱ^L

)

(11)

Backward pass

for l = L to 2

s

_j^l⁻¹

= ′ f u ( )

_j^l⁻¹

^s

ⁱ^l

i=1 n_l

∑ ^W

^ij^l

(12)

Learning update

• In any order

∆W

_ij^l

= η s

_i^l

x

_j^l⁻¹

∆b

_i^l

= η s

_i^l

(13)

Backprop is a gradient update

• Consider R as function of weights and biases.

∂ R

∂ W

_ij^l

= s

_i^l

x

_j^l−1

∂ R

∂ b

_i^l

= s

_i^l

∆W

_ij^l

= η ∂ ^R

∂ W

_ij^l

∆b

_i^l

= η ∂ ^R

∂ b

_i^l

(14)

Sensitivity lemma

• Sensitivity matrix = outer product

– sensitivity vector – activity vector

• The sensitivity vector is sufficient.

• Generalization of “delta.”

∂ R

∂ W

_ij^l

= ∂ R

∂ b

_i^l

x

_j^l−1

(15)

Coordinate transformation

u

_i^l

= W

_ij^l

f u ( )

_j^l−1

⁺ ^b

ⁱ^l

j=1 n_l₋₁

∑

∂ u

_i^l

∂ u

_j^l⁻¹

= W

_ij^l

f u ′ ( )

_j^l⁻¹

∂ R

∂ u

_i^l

= ∂ R

∂ b

_i^l

(16)

Output layer

x

_i^L

= f u ( )

_i^L

u

_i^L

= W

_ij^L

x

_j^L⁻¹

+ b

_i^L

j

∑

∂ R

∂ b

_i^L

= ′ f u ( )

_i^L

_∂ ^∂ ^R

x

_i^L

(17)

Chain rule

• composition of two functions

u

_l−1

→ u

_l

→ R u

_l−1

→ R

∂ R

∂ u

_j^l⁻¹

= ∂ R

∂ u

_i^l

∂ u

_i^l

∂ u

_j^l⁻¹

i

∑

∂ R

∂ b

_j^l⁻¹

= ∂ R

∂ b

_i^l

W

_ij^l

f u ′ ( )

_j^l⁻¹

i

∑

(18)

Computational complexity

• Naïve estimate

– network output: order N

– each component of the gradient: order N – N components: order N²

• With backprop: order N

(19)

Biological plausibility

• Local: pre- and postsynaptic variables

• Forward and backward passes use same weights

• Extra set of variables

x

_j^l^{−1 W}^ij

l

⎯ → ⎯ x

_i^l

, s

_j^l^{−1 W}^ij

l

← ⎯

⎯ s

_i^l

(20)

Backprop for brain modeling

• Backprop may not be a plausible account of learning in the brain.

• But perhaps the networks created by it are similar to biological neural networks.

• Zipser and Andersen:

– train network

– compare hidden neurons with those found in the brain.

(21)

LeNet

• Weight-sharing

• Sigmoidal neurons

• Learn binary outputs

(22)

Machine learning revolution

• Gradient following

– or other hill-climbing methods

• Empirical error minimization

Hidden layer problem

Backpropagation learning

Simple vs. multilayer

perceptron