Tema 2 - Aprendizaje Supervisado - RN.pdf

(1)

Aprendizaje Supervisado

(2)

Artificial Neural Networks

(3)

Introduction

`

General, robust, and practical method for learning

` Real values ` Real values

` Discrete values ` Vector values

`

Inspired by biological learning systems

` Very complex webs of interconnected neurons ` Very complex webs of interconnected neurons. ` Human brain

` Network 1011 neurons

E h d 104

` Each connected on average to 104 neurons.

` Neuron switching times 10-3 seconds (very slow compared to computers)

` 10-1 second to recognize a familiar face (very fast compared to ` 10-1 second to recognize a familiar face (very fast compared to

computers)

Sequence of neuron firings cannot be more than a few hundreds due to

(4)

Typical neural network representation

( )1

1 1

w

( )1

2 1

w

( )2

1 1

w

( )2

2 1 w 1 x 2 x yˆ

( )1

1

w

( )1

1 r

w

( )1

1 2

w

( )1

2 2

w

( )2

1

w

( )2

1

t

w

( )2

1 2

w

( )2

2 2

w

( )3

1 1

w

( )3

1 2 w 3 x y

( )1

2 r

w

( )

2

( )2

2 t

w

( ) 2

( )3

1 t w ( ) 2 C d n

x 1( )₁

n

w

( )1

2 n

w

( )1

r n

w

( )2

1 t

w

( )2

2 r

w

( )2

t r w ( ) t Capa de

(5)

(6)

Appropriate problems for ANN learning

` Artificial neural networks are suitable for noisy and complex sensor data.

` Characteristics:

` Instances are represented by many attribute-value pairs.

` Target function output may be discrete-valued, real-valued, or vector of several

real or discrete valued attributes.

` Training examples may contain errors.

` Long training times are acceptable.

` Fast evaluation of learned target function may be required.

(7)

Perceptrons

p

(8)

Perceptron

` Input: vector of real-values.

` Calculates a linear combination of these inputs

` Output:

` 1 if the result is greater than some threshold ` 1 if the result is greater than some threshold ` -1 otherwise

` Equation:

(

)

⎧ + + + + >

= 1 if w w x w x L w x 0

(9)

Perceptron

Perceptron …

` w₀ is usually negative (- w₀) as it is a threshold that must be surpassed to in order for the perceptron to output 1.

` If we add an additional constant X₀ = 1 then we can write: ₀

0 >

∑

=

n

i wixi

` Or in vector form:

⋅

>

₀

→ →

x

w

⎞

⎛

→ → →

` So the perceptron function can be written:

⎟

⎠

⎞

⎜

⎝

⎛ ⋅

=

→ → →

x

w

x

o

(

)

sgn

⎧ 1 if y > 0

` where: ⎩ ⎨ ⎧ − > = otherwise y if y 1 0 1 ) sgn(

` So the space of candidate hypothesis considered in perceptron learning are the set of weights:

( ) ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ _∈ _ℜ

= → → n +1

(10)

Representational power of perceptrons

`

Represents hyperplane decision surfaces in n-dimensional

spaces.

` 1 for instances on one side of the hyperplane. ` -1 for instances on the other side.

(11)

Representational

Representational …

`

A single perceptron can be used to represent many boolean

functions

functions.

` Usually 1 if TRUE and -1 if FALSE.

` All primitive booleans functions: AND, OR, NAND, NOR.p , , ,

` Very easy if all input weights all set to the same value (e.g. 1) and setting the threshold w₀ accordingly.

`

Every

boolean function can be represented by some network

of interconnected units

based on

primitive boolean functions

of interconnected units

based on

primitive boolean functions

` SOP or POS (two levels deep).

`

An input to a boolean function can be negated by

changing

(12)

Perceptron training rule

1.

Begin with random weights.

2.

Apply iteratively the perceptron to each training example.

1 M dif erce tr n ei hts hene er it misclassifies an e am le 1. Modify perceptron weights whenever it misclassifies an example. 2. Use the perceptron training rule:

where: i i i

w w

w ← + Δ

(

)

_i

i t o x

w = −

Δ η

` t is the target output for current training example ` o is the output generated by the perceptron

`

η

is the learning rate – moderates degree to which weights change ` is the learning rate – moderates degree to which weights change

usually a small value (0.1) and sometimes is made to decay as iterations

increase.

η

3.

Repeat step 2 until perceptron classifies all training examples

(13)

Perceptron training rule

`

Converges toward succesful weight values within a finite

number of applications of the perceptron training rule.

`

Provided:

` Training examples are linearly separable

(14)

Gradient descent and Delta rule

`

Converges toward a best-fit approximation to the target

function when training examples are not linearly

separable.

`

Uses the

gradient descent

to search the hypothesis

space of possible weight vectors.

`

This rule is the basis of the

Backpropagation

algorithm

(15)

Gradient descent and Delta rule

Gradient descent and Delta rule …

` Unthresholded perceptron

→ → →

` Linear unit: correspond to the first stage of perceptron

→ → →

⋅

=

w

x

o

(

)

first stage of perceptron without the threshold.

` Training error ` Training error

` Of a hypothesis relative to the training examples

D f l

∑

∈

−

≡

D d d d

o

t

w

E

(

)

2

1 )

(

` D: set of training examples ` t_d: target output

(16)

Gradient descent and Delta rule

Gradient descent and Delta rule …

`

Gradient descent search determines a weight vector that

minimizes

E

by:

1. Starting with an arbitrary initial weight vector.

2. Repeatedly modifying it in small steps.

3. At each step, the weight vector is altered in the direction that

produces the steepest descent along the error surface produces the steepest descent along the error surface.

(17)

Gradient descent and Delta rule

Gradient descent and Delta rule …

`

Gradient of

E

with respect to is:

` Vector that specifies the direction that produces the steepest increase

)

(

→

∇

E

w

→

w

` Vector that specifies the direction that produces the steepest increase in E.

T i i l f

di

d

i

→ →

Δ

→

`

Training rule for

gradient descent

is:

where:

Δ

+

←

w

)

(

→

∇

−

=

Δ

w

η

E

w

`

For the component form it can be rewritten as:

)

(

∇

Δ

w

η

E

w

→ → → ` where: i i

i

w

→ → →

Δ

+

←

∑

∈

−

=

Δ

D d id d d

i

t

o

x

w

η

(

)

∈D

(18)

Gradient descent and Delta rule

Gradient descent and Delta rule …

Gradient-Descent (training_examples, )

η

_→ _→

Each training example is a pair of the form , where is

the vector of input values, and

t

is the target output value, is

the learning rate.

t

x,

_x

η

the learning rate.

1.

Initialize each

w

_i

to some small random value.

2.

Until the termination condition is met, Do

,

1. Initialize each to zero.

2. For each in training_examples, Do

I h i h i d h

i w Δ t x, → →

1. Input the instance to the unit and compute the output o 2. For each linear unit weight w_i, Do

x i x o t wi

wi ← Δ + ( − )

Δ

η

3. For each linear unit weight w_i, Do

Δ

+

←

i x o t wi

wi ← Δ + ( )

Δ

η

(19)

Gradient descent and Delta rule

Gradient descent and Delta rule …

`

Advantages

` Search strategy for large or infinite hypothesis space.

` Hypothesis space contains continously parameterized hypothesis (like

ei hts in a linear nit) weights in a linear unit)

` Error can be differentiated with respect to these hypothesis

parameters.

`

Disadvantages

` Converging to a local minimun can be quite slow – thousand of

descent steps.

` No guarantee of finding the global minimum with multiple

(20)

Gradient descent and Delta rule

Gradient descent and Delta rule …

`

Stochastic gradient descent (incremental gradient descent)

` Alternative to allieviate disadvanteges of gradient descent ` Alternative to allieviate disadvanteges of gradient descent.

` Instead of updating weigths after summing over all training examples,

i ht d t d ft h t i i l i t d

weights are updated after each training example is presented.

` Delta rule (LMS rule, Adaline rule, Widrow-Hoff rule):

i

x

o

t

wi

←

(

−

)

Δ

η

` So the error is defined over each individual training example:

1

→

2

)

(

2

1 )

(

_d _d

d

w

t

o

E

=

−

(21)

Gradient descent and Delta rule

Gradient descent and Delta rule …

Stochastic-Gradient-Descent (training_examples, )

_→

η

_→

Each training example is a pair of the form , where is

the vector of input values, and

t

is the target output value, is

the learning rate.

t

x,

_x

η

the learning rate.

1.

Initialize each

w

_i

to some small random value.

2.

Until the termination condition is met, Do

,

1. Initialize each to zero.

2. For each in training_examples, Do

I h i h i d h

i

w

Δ

t x,

→

1. Input the instance to the unit and compute the output o 2. For each linear unit weight w_i, Do

x

t i

i wi (t o)x_i

(22)

Gradient descent and Delta rule …

Perceptron training rule

D lt l

Perceptron training rule

Delta rule

`

Based on error in the

h

h ld d

`

Based on error in the

h

h ld d

thresholded

perceptron

output.

unthresholded

output.

perceptron

`

Converges after a finite

number of iterations.

`

Converges asymptotically,

probably requiring

`

Hypothesis perfectly

p

y q

g

unbounded time.

classifies training data

provided training examples

are linearly separable.

`

Converges regardless of

training data being linearly

separable.

(23)

Multilayer networks and the

y

Backpropagation algorithm

(24)

The sigmoid threshold unit

` Basis for constructing multilayer networks ` Linear units produce linear functions even with

multilayer networks.

(

)

→ →

⋅

=

w

x

o

σ

multilayer networks.

` Sigmoid units are capable of representing nonlinear functions.

)

(

w

x

o

σ

1

functions.

` Provides a smooth differentiable threshold function.

It is has a very simple derivative.

Output is between 0.0 and 1.0

y

e

y

o

₋

+

=

1

1 )

(25)

Backpropagation algorithm

`

Learns the weights of a multilayer network.

`

Employs gradient descent to minimize squared error

p y g

q

between network’s output and target.

→

₁

`

Error

E

is redefined as:

∑ ∑

∈ ∈ →

−

=

D

d k outputs

kd

o

t

w

E

(

)

2

1 )

(

`

Hypothesis space is defined over all possible values of

(26)

Backpropagation(training_examples, ,n_in, n_out, n_hidden) – stochastic version

Each training example is a pair of the form , where is the vector of network input values, and is the vector of target network values

→ → t x, → η → x

and is the vector of target network values.

n_in is the number of network inputs, n_hidden the number of units in the hidden layer, and n_out the number of output units.

The input from unit i into unit j is denoted x_ji, and the weight from unit i to unit j is denoted w_ji.

t

j j

` Create a feed-forward network with n_in inputs, n_hidden units, and n_out units. ` Initialize all network weights to small random numbers (e g -0 5 to 0 5) ` Initialize all network weights to small random numbers (e.g. 0.5 to 0.5). ` Until the termination condition is met, Do

` For each in training_examples, Do

` Propagate the input forward through the network:

→ →

t x,

` Propagate the input forward through the network:

1. Input the instance to the network and compute the output o_uof every unit uin the network.

` Propagate the errors backward through the network:

→ x ) )( ( δ

2. For each network output unit k, calculate its error term

3. For each hidden unit h, calculate its error term

) )(

1

( _k _k _k

k

k ←o −o t −o

δ

∑

∈ − ← outputs k k kh h h

h o o w δ

δ (1 )

4. Update each network weight w_ji

where

ji ji

ji w w

w ← +Δ

x w =ηδ

(27)

Backpropagation algorithm

Backpropagation algorithm …

`

The true gradient of

E

(non-stochastic) can be obtained by

performing before updating weight values

_∑

performing before updating weight values.

`

Number of iterations can be thousands of times

∑

∈D d

ji jx

δ

`

Number of iterations can be thousands of times.

`

Termination conditions:

`

Termination conditions:

` Fixed number of iterations.

` Error on training examples falls below a threshold.

` Error on a validation set of examples meets a criterion.

` Important: ` Important:

(28)

Backpropagation algorithm

Backpropagation algorithm …

`

Momentum

` Variation of backpropagation (most common).

` Weight update depends partially on previous iteration.

( )

=

+

Δ

(

1 )

Δ

w

( )

n

=

ηδ

x

+

α

Δ

w

(

n

−

1 )

Δ

w

_ji

n

ηδ

_j

x

_ji

α

w

_ji

n

` is a constant called momentum

` Si il b ll lli d f d k i h b ll lli i

α 0 ≤α <1

` Similar to a ball rolling down a surface and keeping the ball rolling in

(29)

Backpropagation algorithm

Backpropagation algorithm …

`

Arbitrary acyclic networks

` A network of m layers and r units is computed by using:

(

)

_∑

=

o

1 o

w

δ

` For any acyclic network, regardless of whether the units are

(

)

_∑

+ ∈

−

=

1

m layer s s sr r r

r

o

w

δ

` For any acyclic network, regardless of whether the units are

arrenged in uniform layers, the following is used:

(

o

)

_∑

w

o

δ

1

` h Do st ea ( ) i th t f it i di t l

(

)

( )

∑

∈

−

=

r Downstream s s sr r r

r

o

w

δ

1

` where Downstream(r) is the set of units immediately

(30)

Backpropagation algorithm

Backpropagation algorithm …

` Convergence and local minima

` Backpropagation over multilayer networks is only guaranteed to ` Backpropagation over multilayer networks is only guaranteed to

converge toward some local minima.

` Error surface in multilayer networks can contain many local minima.

` Backpropagation is in practice highly effective.

If a local minima is found in one weight the most probable is that this is not a local minima

for other weights, allowing backpropagation to get out of the local minima.

` Gradient descent over complex error surfaces is poorly understood.

` C h i ti t id l l i i

` Common heuristics to avoid local minima are:

` Use momentum for weight update. ` Use stochastic gradient descent.

` Train multiple networks using the same data but different random weights ` Train multiple networks using the same data but different random weights.

Select network with best performance.

(31)

Backpropagation algorithm

Backpropagation algorithm …

`

Representational power

` Boolean functions ` Boolean functions

` Generate an input unit that is active only when an specific value is input to the network (AND).

` Implement an output unit as an OR ` Implement an output unit as an OR.

` Continous functions

` E b d d ti f ti b i t d ith

` Every bounded continuous function can be approximated with an arbitrarily small error by a network of two (2) layers (Cybenko, 1989).

Sigmoid units at hidden layer. Linear units at output layer. Linear units at output layer.

` Arbitrary functions

` Any function can be approximated to arbitrary accuracy by a network of ` Any function can be approximated to arbitrary accuracy by a network of

three (3) layers (Cybenko, 1988).

(32)

Backpropagation algorithm

Backpropagation algorithm …

`

Hypothesis space

` Every possible assignment of network weights.

` n-dimensional space of the n network weights.

` Continuous space

` Well-defined error gradient.

`

Inductive Bias

(33)

Backpropagation algorithm

Backpropagation algorithm …

`

Hidden layers

` Backpropagation has the ability of discovering useful

intermediate representations at the hidden units.

` Weight-tuning procedure is free to set weights that define

(34)

(35)

Backpropagation algorithm

Backpropagation algorithm …

`

Generalization, overfitting and stopping criterion

` Stopping the weight update cycle until the error E is below a

threshold is not a good solution.

B k i i ibl

` Backpropagation is succeptible to

overfitting.

` Overfitting occurs when the algorithm starts learning very ` Overfitting occurs when the algorithm starts learning very

(36)

Backpropagation algorithm

Backpropagation algorithm …

`

Generalization, overfitting and stopping criterion…

` Solutions: ` Solutions:

` Weight decay – decrease each weight by some small factor during each iteration.

` Set of validation data

One of the most succesful One of the most succesful.

Algorithm monitors error with respecto to validation set. Uses training set to update weights.

Should use the number of iterations that produces the lowest error over the Should use the number of iterations that produces the lowest error over the

validation set.

Store a set of weights for training and another with the best-performing weights so

(37)

Example: Face Recognition

`

Classify camera images of faces of people in various

poses.

` 20 different people, 32 images per person.

` Expression varied (happy, sad, angry neutral).

` Direction looking varied (left, right, straight, ahead). ` Wearing sun glasses or not

` Wearing sun glasses or not.

` 624 greyscale images, resolution 120x128.

(38)

Example: Face Recognition

Example: Face Recognition …

`

Possible target functions:

` Identity of person.

` Direction person facing.

` Gender of person.

` Wearing glasses? ` Long hair?

(39)

Example: Face Recognition

Example: Face Recognition …

`

On a set of 260 images, classification is 90%.

`

Design choices:

` Input encodingp g

` Extract edges? Regions? Objects?

Variable number of features

Is a problem (ANN has fixed number of inputs).

` Image of 30x32 pixels was used.

` Output encodingp g

` One of the four possible facing values.

1-of-n encoding.

More degrees of freedom to represent output.

Diff i l b b d fid l

Difference in value between outputs can be used as a confidence value.

` One unit is an option but not optimal.

(40)

Example: Face Recognition

Example: Face Recognition …

`

Design choices:

` Network graph structure:

` How many units?

One or two layers of sigmoid units. One or two layers of sigmoid units.

Some times three layers – more than 3 layers are not used in practice. Hidden layer units were 3.

Experiments with 30 units only 2% higher in accuracy. Experiments with 30 units only 2% higher in accuracy.

` How to interconnect them?

Most common: layered network with feedforward connections from every Most common: layered network with feedforward connections from every

unit in one layer to every unit in the next.

` Time in a Sun Sparc5 workstation ` Time in a Sun Sparc5 workstation

(41)

Example: Face Recognition

Example: Face Recognition …

`

Design choices:

` Learning algorithm parameters:

` Learning rate 0.3 ` Momentum: 0.3

Lower values of both parameters produced almost same results but longer

training time.

High values of both parameters, training fails to converge.

` Full gradient descent was used (not stochastic) ` Full gradient descent was used (not stochastic)

` Weights of output values were initialized to small random values.g p

(42)

(43)

Advanced topics in ANN

p

(44)

Alternative error functions

`

Adding a penalty term for weight magnitude.

` Forces to seek weights vectors with small magnitudes ` Forces to seek weights vectors with small magnitudes. ` Reduces the risk of over fitting.

∑ ∑

∑

→

E( ) 1 ( )2 2

`

Minimizing the cross entropy of the network with respect to

∑ ∑

∑

∈ ∈ + − ≡ D

d k outputs i j

ji kd

kd o w

t w E , 2 2 ) ( 2 1 ) (

γ

`

Minimizing the cross entropy of the network with respect to

the target values.

` Networks outputs probability estimates.

∑

∈

−

+

−

D d d d d

d

o

t

o

t

log

(

1 )

log(

1 )

` o_d is the probability estimate output by the network for training example d.

(45)

Recurrent networks

`

Recurrent networks are:

` artificial neural networks that apply to time series data and

f k i i h i h

` use outputs of network units at time t as the input to other

(46)