• No se han encontrado resultados

Hidden layer problem

N/A
N/A
Protected

Academic year: 2023

Share "Hidden layer problem"

Copied!
22
0
0

Texto completo

(1)

Backpropagation learning

MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung

(2)

Simple vs. multilayer

perceptron

(3)

Hidden layer problem

• Radical change for the supervised learning problem.

• No desired values for the hidden layer.

• The network must find its own hidden layer activations.

(4)

Generalized delta rule

• Delta rule only works for the output layer.

• Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers

(5)

Outline

• The algorithm

• Derivation as a gradient algoritihm

• Sensitivity lemma

(6)

Multilayer perceptron

L layers of weights and biases

L+1 layers of neurons

x

0 W

⎯ → ⎯

1,b1

x

1 W

⎯ → ⎯

2 ,b2

L

W

⎯ → ⎯

L,bL

x

L

x

il

= f W

ijl

x

jl−1

+ b

il

j=1 nl−1

⎛ ∑

⎝ ⎜ ⎞

(7)

Reward function

• Depends on activity of the output layer only.

• Maximize reward with respect to weights and biases.

R x ( )

L

(8)

Example: squared error

• Square of desired minus actual output, with minus sign.

R x ( )

L

= − 1

2 ( d

i

x

iL

)

2

i=1 nL

(9)

Forward pass

For l = 1 to L,

u

il

= W

ijl

x

jl−1

+ b

il

j=1 nl−1

x

il

= f u ( )

il

(10)

Sensitivity computation

• The sensitivity is also called “delta.”

s

iL

= ′ f u ( )

iL

R

x

iL

= ′ f u ( )

iL

( d

i

x

iL

)

(11)

Backward pass

for l = L to 2

s

jl−1

= ′ f u ( )

jl−1

s

il

i=1 nl

W

ijl

(12)

Learning update

• In any order

W

ijl

= η s

il

x

jl−1

b

il

= η s

il

(13)

Backprop is a gradient update

• Consider R as function of weights and biases.

R

W

ijl

= s

il

x

jl−1

R

b

il

= s

il

W

ijl

= η ∂ R

W

ijl

b

il

= η ∂ R

b

il

(14)

Sensitivity lemma

• Sensitivity matrix = outer product

– sensitivity vector – activity vector

• The sensitivity vector is sufficient.

• Generalization of “delta.”

R

W

ijl

= ∂ R

b

il

x

jl−1

(15)

Coordinate transformation

u

il

= W

ijl

f u ( )

jl−1

+ b

il

j=1 nl−1

u

il

u

jl−1

= W

ijl

f u ′ ( )

jl−1

R

u

il

= ∂ R

b

il

(16)

Output layer

x

iL

= f u ( )

iL

u

iL

= W

ijL

x

jL−1

+ b

iL

j

R

b

iL

= ′ f u ( )

iL

R

x

iL

(17)

Chain rule

• composition of two functions

u

l−1

u

l

R u

l−1

R

R

u

jl−1

= ∂ R

u

il

u

il

u

jl−1

i

R

b

jl−1

= ∂ R

b

il

W

ijl

f u ′ ( )

jl−1

i

(18)

Computational complexity

• Naïve estimate

– network output: order N

– each component of the gradient: order N N components: order N2

• With backprop: order N

(19)

Biological plausibility

• Local: pre- and postsynaptic variables

• Forward and backward passes use same weights

• Extra set of variables

x

jl−1 Wij

l

⎯ → ⎯ x

il

, s

jl−1 Wij

l

← ⎯

s

il

(20)

Backprop for brain modeling

• Backprop may not be a plausible account of learning in the brain.

• But perhaps the networks created by it are similar to biological neural networks.

• Zipser and Andersen:

– train network

– compare hidden neurons with those found in the brain.

(21)

LeNet

• Weight-sharing

• Sigmoidal neurons

• Learn binary outputs

(22)

Machine learning revolution

• Gradient following

– or other hill-climbing methods

• Empirical error minimization

Referencias

Documento similar

Cuatro lóbulos Lóbulo frontal Lóbulo poscentral Lóbulo parietal Lóbulo occipital Sistema límbico A mígdala Hipocampo Fórni x Corteza cingulada Septu m Cuerpos mamilares