Aprendizaje Supervisado
Aprendizaje Supervisado
Artificial Neural Networks
Introduction
Introduction
`
General, robust, and practical method for learning
` Real values ` Real values
` Discrete values ` Vector values
`
Inspired by biological learning systems
` Very complex webs of interconnected neurons ` Very complex webs of interconnected neurons. ` Human brain
` Network 1011 neurons
E h d 104
` Each connected on average to 104 neurons.
` Neuron switching times 10-3 seconds (very slow compared to computers)
` 10-1 second to recognize a familiar face (very fast compared to ` 10-1 second to recognize a familiar face (very fast compared to
computers)
Sequence of neuron firings cannot be more than a few hundreds due to
Typical neural network representation
Typical neural network representation
( )1
1 1
w
( )1
2 1
w
( )2
1 1
w
( )2
2 1 w 1 x 2 x yˆ
( )1
1
w
( )1
1 r
w
( )1
1 2
w
( )1
2 2
w
( )2
1
w
( )2
1
t
w
( )2
1 2
w
( )2
2 2
w
( )3
1 1
w
( )3
1 2 w 3 x y
( )1
2 r
w
( )
2
( )2
2 t
w
( ) 2
( )3
1 t w ( ) 2 C d n
x 1( )1
n
w
( )1
2 n
w
( )1
r n
w
( )2
1 t
w
( )2
2 r
w
( )2
t r w ( ) t Capa de
Appropriate problems for ANN learning
Appropriate problems for ANN learning
` Artificial neural networks are suitable for noisy and complex sensor data.
` Characteristics:
` Instances are represented by many attribute-value pairs.
` Target function output may be discrete-valued, real-valued, or vector of several
real or discrete valued attributes.
` Training examples may contain errors.
` Long training times are acceptable.
` Long training times are acceptable.
` Fast evaluation of learned target function may be required.
Perceptrons
p
Perceptron
Perceptron
` Input: vector of real-values.
` Calculates a linear combination of these inputs
` Output:
` 1 if the result is greater than some threshold ` 1 if the result is greater than some threshold ` -1 otherwise
` Equation:
(
)
⎧ + + + + >= 1 if w w x w x L w x 0
Perceptron
Perceptron …
` w0 is usually negative (- w0) as it is a threshold that must be surpassed to in order for the perceptron to output 1.
` If we add an additional constant X0 = 1 then we can write: 0
0 >
∑
=n
i wixi
` Or in vector form:
⋅
>
0
→ →
x
w
⎞
⎛
→ → →` So the perceptron function can be written:
⎟
⎠
⎞
⎜
⎝
⎛ ⋅
=
→ → →x
w
x
o
(
)
sgn
⎧ 1 if y > 0
` where: ⎩ ⎨ ⎧ − > = otherwise y if y 1 0 1 ) sgn(
` So the space of candidate hypothesis considered in perceptron learning are the set of weights:
( ) ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ∈ ℜ
= → → n +1
Representational power of perceptrons
Representational power of perceptrons
`
Represents hyperplane decision surfaces in n-dimensional
spaces.
` 1 for instances on one side of the hyperplane. ` -1 for instances on the other side.
Representational
Representational …
`
A single perceptron can be used to represent many boolean
functions
functions.
` Usually 1 if TRUE and -1 if FALSE.
` All primitive booleans functions: AND, OR, NAND, NOR.p , , ,
` Very easy if all input weights all set to the same value (e.g. 1) and setting the threshold w0 accordingly.
`
Every
boolean function can be represented by some network
of interconnected units
based on
primitive boolean functions
of interconnected units
based on
primitive boolean functions
` SOP or POS (two levels deep).`
An input to a boolean function can be negated by
changing
Perceptron training rule
Perceptron training rule
1.Begin with random weights.
2.
Apply iteratively the perceptron to each training example.
1 M dif erce tr n ei hts hene er it misclassifies an e am le 1. Modify perceptron weights whenever it misclassifies an example. 2. Use the perceptron training rule:
where: i i i
w w
w ← + Δ
(
)
ii t o x
w = −
Δ η
` t is the target output for current training example ` o is the output generated by the perceptron
`
η
is the learning rate – moderates degree to which weights change ` is the learning rate – moderates degree to which weights changeusually a small value (0.1) and sometimes is made to decay as iterations
increase.
η
3.
Repeat step 2 until perceptron classifies all training examples
Perceptron training rule
Perceptron training rule
`
Converges toward succesful weight values within a finite
number of applications of the perceptron training rule.
`
Provided:
` Training examples are linearly separable
Gradient descent and Delta rule
Gradient descent and Delta rule
`
Converges toward a best-fit approximation to the target
function when training examples are not linearly
separable.
`
Uses the
gradient descent
to search the hypothesis
space of possible weight vectors.
`
This rule is the basis of the
Backpropagation
algorithm
Gradient descent and Delta rule
Gradient descent and Delta rule …
` Unthresholded perceptron
→ → →
` Linear unit: correspond to the first stage of perceptron
→ → →
⋅
=
w
x
x
o
(
)
first stage of perceptron without the threshold.
` Training error ` Training error
` Of a hypothesis relative to the training examples
D f l
∑
∈−
≡
D d d do
t
w
E
(
)
22
1
)
(
` D: set of training examples ` td: target output
Gradient descent and Delta rule
Gradient descent and Delta rule …
`
Gradient descent search determines a weight vector that
minimizes
E
by:
1. Starting with an arbitrary initial weight vector.
2. Repeatedly modifying it in small steps.
3. At each step, the weight vector is altered in the direction that
produces the steepest descent along the error surface produces the steepest descent along the error surface.
Gradient descent and Delta rule
Gradient descent and Delta rule …
`
Gradient of
E
with respect to is:
` Vector that specifies the direction that produces the steepest increase
)
(
→
∇
E
w
→
w
` Vector that specifies the direction that produces the steepest increase in E.
T i i l f
di
d
i
→ →Δ
→`
Training rule for
gradient descent
is:
where:
Δ
+
←
w
w
w
)
(
→∇
−
=
Δ
w
η
E
w
`
For the component form it can be rewritten as:
)
(
∇
Δ
w
η
E
w
→ → → ` where: i i
i
w
w
w
→ → →Δ
+
←
∑
∈−
=
Δ
D d id d di
t
o
x
w
η
(
)
∈DGradient descent and Delta rule
Gradient descent and Delta rule …
Gradient-Descent (training_examples, )
η
→ →Each training example is a pair of the form , where is
the vector of input values, and
t
is the target output value, is
the learning rate.
t
x,
x
η
the learning rate.
1.
Initialize each
w
ito some small random value.
2.Until the termination condition is met, Do
,
1. Initialize each to zero.
2. For each in training_examples, Do
I h i h i d h
i w Δ t x, → →
1. Input the instance to the unit and compute the output o 2. For each linear unit weight wi, Do
x i x o t wi
wi ← Δ + ( − )
Δ
η
3. For each linear unit weight wi, Do
Δ
+
←
i x o t wiwi ← Δ + ( )
Δ
η
Gradient descent and Delta rule
Gradient descent and Delta rule …
`
Advantages
` Search strategy for large or infinite hypothesis space.
` Hypothesis space contains continously parameterized hypothesis (like
ei hts in a linear nit) weights in a linear unit)
` Error can be differentiated with respect to these hypothesis
parameters.
`
Disadvantages
` Converging to a local minimun can be quite slow – thousand of
descent steps.
` No guarantee of finding the global minimum with multiple
Gradient descent and Delta rule
Gradient descent and Delta rule …
`
Stochastic gradient descent (incremental gradient descent)
` Alternative to allieviate disadvanteges of gradient descent ` Alternative to allieviate disadvanteges of gradient descent.
` Instead of updating weigths after summing over all training examples,
i ht d t d ft h t i i l i t d
weights are updated after each training example is presented.
` Delta rule (LMS rule, Adaline rule, Widrow-Hoff rule):
i
x
o
t
wi
←
(
−
)
Δ
η
` So the error is defined over each individual training example:
1
→
2
)
(
2
1
)
(
d dd
w
t
o
E
=
−
Gradient descent and Delta rule
Gradient descent and Delta rule …
Stochastic-Gradient-Descent (training_examples, )
→η
→Each training example is a pair of the form , where is
the vector of input values, and
t
is the target output value, is
the learning rate.
t
x,
x
η
the learning rate.
1.
Initialize each
w
ito some small random value.
2.Until the termination condition is met, Do
,
1. Initialize each to zero.
2. For each in training_examples, Do
I h i h i d h
i
w
Δ
t x,
→
→
1. Input the instance to the unit and compute the output o 2. For each linear unit weight wi, Do
x
t i
i wi (t o)xi
Gradient descent and Delta rule …
Perceptron training rule
D lt l
Perceptron training rule
Delta rule
`
Based on error in the
h
h ld d
`Based on error in the
h
h ld d
thresholded
perceptron
output.
unthresholded
output.
perceptron
`
Converges after a finite
number of iterations.
`Converges asymptotically,
probably requiring
`
Hypothesis perfectly
p
y q
g
unbounded time.
classifies training data
provided training examples
are linearly separable.
`
Converges regardless of
training data being linearly
separable.
Multilayer networks and the
y
Backpropagation algorithm
The sigmoid threshold unit
The sigmoid threshold unit
` Basis for constructing multilayer networks ` Linear units produce linear functions even with
multilayer networks.
(
)
→ →
⋅
=
w
x
o
σ
multilayer networks.
` Sigmoid units are capable of representing nonlinear functions.
)
(
w
x
o
σ
1
functions.
` Provides a smooth differentiable threshold function.
It is has a very simple derivative.
Output is between 0.0 and 1.0
y
e
y
o
−+
=
1
1
)
Backpropagation algorithm
Backpropagation algorithm
`
Learns the weights of a multilayer network.
`
Employs gradient descent to minimize squared error
p y g
q
between network’s output and target.
→
1
`
Error
E
is redefined as:
∑ ∑
∈ ∈ →
−
=
D
d k outputs
kd
kd
o
t
w
E
(
)
22
1
)
(
`
Hypothesis space is defined over all possible values of
Backpropagation(training_examples, ,nin, nout, nhidden) – stochastic version
Each training example is a pair of the form , where is the vector of network input values, and is the vector of target network values
→ → t x, → η → x
and is the vector of target network values.
nin is the number of network inputs, nhidden the number of units in the hidden layer, and nout the number of output units.
The input from unit i into unit j is denoted xji, and the weight from unit i to unit j is denoted wji.
t
j j
` Create a feed-forward network with nin inputs, nhidden units, and nout units. ` Initialize all network weights to small random numbers (e g -0 5 to 0 5) ` Initialize all network weights to small random numbers (e.g. 0.5 to 0.5). ` Until the termination condition is met, Do
` For each in training_examples, Do
` Propagate the input forward through the network:
→ →
t x,
` Propagate the input forward through the network:
1. Input the instance to the network and compute the output ouof every unit uin the network.
` Propagate the errors backward through the network:
→ x ) )( ( δ
2. For each network output unit k, calculate its error term
3. For each hidden unit h, calculate its error term
) )(
1
( k k k
k
k ←o −o t −o
δ
∑
∈ − ← outputs k k kh h hh o o w δ
δ (1 )
4. Update each network weight wji
where
ji ji
ji w w
w ← +Δ
x w =ηδ
Backpropagation algorithm
Backpropagation algorithm …
`
The true gradient of
E
(non-stochastic) can be obtained by
performing before updating weight values
∑
performing before updating weight values.
`
Number of iterations can be thousands of times
∑
∈D d
ji jx
δ
`
Number of iterations can be thousands of times.
`
Termination conditions:
`Termination conditions:
` Fixed number of iterations.
` Error on training examples falls below a threshold.
` Error on a validation set of examples meets a criterion.
` Important: ` Important:
Backpropagation algorithm
Backpropagation algorithm …
`
Momentum
` Variation of backpropagation (most common).
` Weight update depends partially on previous iteration.
( )
=
+
Δ
(
1
)
Δ
w
( )
n
=
ηδ
x
+
α
Δ
w
(
n
−
1
)
Δ
w
jin
ηδ
jx
jiα
w
jin
` is a constant called momentum
` Si il b ll lli d f d k i h b ll lli i
α 0 ≤α <1
` Similar to a ball rolling down a surface and keeping the ball rolling in
Backpropagation algorithm
Backpropagation algorithm …
`
Arbitrary acyclic networks
` A network of m layers and r units is computed by using:
(
)
∑
=
o
1
o
w
δ
δ
` For any acyclic network, regardless of whether the units are
(
)
∑
+ ∈−
=
11
m layer s s sr r rr
o
o
w
δ
δ
` For any acyclic network, regardless of whether the units are
arrenged in uniform layers, the following is used:
(
o
)
∑
w
o
δ
δ
1
` h Do st ea ( ) i th t f it i di t l
(
)
( )∑
∈−
=
r Downstream s s sr r rr
o
o
w
δ
δ
1
` where Downstream(r) is the set of units immediately
Backpropagation algorithm
Backpropagation algorithm …
` Convergence and local minima
` Backpropagation over multilayer networks is only guaranteed to ` Backpropagation over multilayer networks is only guaranteed to
converge toward some local minima.
` Error surface in multilayer networks can contain many local minima.
` Backpropagation is in practice highly effective.
If a local minima is found in one weight the most probable is that this is not a local minima
for other weights, allowing backpropagation to get out of the local minima.
` Gradient descent over complex error surfaces is poorly understood.
` C h i ti t id l l i i
` Common heuristics to avoid local minima are:
` Use momentum for weight update. ` Use stochastic gradient descent.
` Train multiple networks using the same data but different random weights ` Train multiple networks using the same data but different random weights.
Select network with best performance.
Backpropagation algorithm
Backpropagation algorithm …
`
Representational power
` Boolean functions ` Boolean functions
` Generate an input unit that is active only when an specific value is input to the network (AND).
` Implement an output unit as an OR ` Implement an output unit as an OR.
` Continous functions
` E b d d ti f ti b i t d ith
` Every bounded continuous function can be approximated with an arbitrarily small error by a network of two (2) layers (Cybenko, 1989).
Sigmoid units at hidden layer. Linear units at output layer. Linear units at output layer.
` Arbitrary functions
` Any function can be approximated to arbitrary accuracy by a network of ` Any function can be approximated to arbitrary accuracy by a network of
three (3) layers (Cybenko, 1988).
Backpropagation algorithm
Backpropagation algorithm …
`
Hypothesis space
` Every possible assignment of network weights.
` n-dimensional space of the n network weights.
` Continuous space
` Well-defined error gradient.
`
Inductive Bias
Backpropagation algorithm
Backpropagation algorithm …
`
Hidden layers
` Backpropagation has the ability of discovering useful
intermediate representations at the hidden units.
` Weight-tuning procedure is free to set weights that define
Backpropagation algorithm
Backpropagation algorithm …
`
Generalization, overfitting and stopping criterion
` Stopping the weight update cycle until the error E is below a
threshold is not a good solution.
B k i i ibl
` Backpropagation is succeptible to
overfitting.
` Overfitting occurs when the algorithm starts learning very ` Overfitting occurs when the algorithm starts learning very
Backpropagation algorithm
Backpropagation algorithm …
`
Generalization, overfitting and stopping criterion…
` Solutions: ` Solutions:
` Weight decay – decrease each weight by some small factor during each iteration.
` Set of validation data
One of the most succesful One of the most succesful.
Algorithm monitors error with respecto to validation set. Uses training set to update weights.
Should use the number of iterations that produces the lowest error over the Should use the number of iterations that produces the lowest error over the
validation set.
Store a set of weights for training and another with the best-performing weights so
Example: Face Recognition
Example: Face Recognition
`
Classify camera images of faces of people in various
poses.
` 20 different people, 32 images per person.
` Expression varied (happy, sad, angry neutral).
` Direction looking varied (left, right, straight, ahead). ` Wearing sun glasses or not
` Wearing sun glasses or not.
` 624 greyscale images, resolution 120x128.
Example: Face Recognition
Example: Face Recognition …
`
Possible target functions:
` Identity of person.
` Direction person facing.
` Gender of person.
` Wearing glasses? ` Long hair?
Example: Face Recognition
Example: Face Recognition …
`
On a set of 260 images, classification is 90%.
`
Design choices:
` Input encodingp g
` Extract edges? Regions? Objects?
Variable number of features
Is a problem (ANN has fixed number of inputs).
` Image of 30x32 pixels was used.
` Output encodingp g
` One of the four possible facing values.
1-of-n encoding.
More degrees of freedom to represent output.
Diff i l b b d fid l
Difference in value between outputs can be used as a confidence value.
` One unit is an option but not optimal.
Example: Face Recognition
Example: Face Recognition …
`
Design choices:
` Network graph structure:
` How many units?
One or two layers of sigmoid units. One or two layers of sigmoid units.
Some times three layers – more than 3 layers are not used in practice. Hidden layer units were 3.
Experiments with 30 units only 2% higher in accuracy. Experiments with 30 units only 2% higher in accuracy.
` How to interconnect them?
Most common: layered network with feedforward connections from every Most common: layered network with feedforward connections from every
unit in one layer to every unit in the next.
` Time in a Sun Sparc5 workstation ` Time in a Sun Sparc5 workstation
Example: Face Recognition
Example: Face Recognition …
`
Design choices:
` Learning algorithm parameters:
` Learning rate 0.3 ` Momentum: 0.3
Lower values of both parameters produced almost same results but longer
training time.
High values of both parameters, training fails to converge.
` Full gradient descent was used (not stochastic) ` Full gradient descent was used (not stochastic)
` Weights of output values were initialized to small random values.g p
Advanced topics in ANN
p
Alternative error functions
Alternative error functions
`
Adding a penalty term for weight magnitude.
` Forces to seek weights vectors with small magnitudes ` Forces to seek weights vectors with small magnitudes. ` Reduces the risk of over fitting.
∑ ∑
∑
→
E( ) 1 ( )2 2
`
Minimizing the cross entropy of the network with respect to
∑ ∑
∑
∈ ∈ + − ≡ Dd k outputs i j
ji kd
kd o w
t w E , 2 2 ) ( 2 1 ) (
γ
`
Minimizing the cross entropy of the network with respect to
the target values.
` Networks outputs probability estimates.
∑
∈−
−
+
−
D d d d dd
o
t
o
t
log
(
1
)
log(
1
)
` od is the probability estimate output by the network for training example d.
Recurrent networks
Recurrent networks
`
Recurrent networks are:
` artificial neural networks that apply to time series data and
f k i i h i h
` use outputs of network units at time t as the input to other