Principal component analysis

(1)

Principal component analysis

MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung

(2)

Hypothesis: Hebbian synaptic plasticity enables perceptrons

to perform principal

component analysis

(3)

Outline

• Variance and covariance

• Principal components

• Maximizing parallel variance

• Minimizing perpendicular variance

• Neural implementation

– covariance rule

(4)

Principal component

• direction of maximum variance in the input space

• principal eigenvector of the covariance matrix

• goal: relate these two definitions

(5)

Variance

• A random variable fluctuating about its mean value.

• Average of the square of the fluctuations.

δ x = x − x

δ x

( )

²

⁼ ^x

²

⁻ ^x

²

(6)

Covariance

• Pair of random variables, each fluctuating about its mean value.

• Average of product of fluctuations.

δ x

₁

= x

₁

− x

₁

δ x

₂

= x

₂

− x

₂

δ x

₁

δ x

₂

= x

₁

x

₂

− x

₁

x

₂

(7)

Covariance examples

(8)

Covariance matrix

• N random variables

• NxN symmetric matrix

• Diagonal elements are variances

C

_ij

= x

_i

x

_j

− x

_i

x

_j

(9)

Principal components

• eigenvectors with k largest eigenvalues

• Now you can calculate them, but what do they mean?

(10)

Handwritten digits

(11)

Principal component in 2d

(12)

One-dimensional projection

(13)

Covariance to variance

• From the covariance, the variance of any projection can be calculated.

• Let w be a unit vector

w

^T

x

( )

²

⁻ ^w

^T

^x

²

⁼ ^w

^T

^Cw

= w

_i

C

_ij

w

_j

ij

∑

(14)

Maximizing parallel variance

• Principal eigenvector of C

– the one with the largest eigenvalue.

w

^*

= arg max

w:w =1

w

^T

Cw

λ

_max

( ) C ⁼ ^max

w:w =1

w

^T

Cw

= w

^*T

Cw

^*

(15)

Orthogonal decomposition

(16)

Total variance is conserved

• Maximizing parallel variance =

Minimizing perpendicular variance

(17)

Rubber band computer

(18)

Correlation/covariance rule

• presynaptic activity x

• postsynaptic activity y

• Hebbian

∆w = η yx

∆w = η ( y − y ) ( ^x ⁻ ^x )

(19)

Stochastic gradient ascent

• Assume data has zero mean

• Linear perceptron

y = w

^T

x ∆w = η xx

^T

w E = 1

2 w

^T

Cw C = xx

^T

∆w = η∂ ^E

∂ w

(20)

Preventing divergence

• Bound constraints

(21)

Oja’s rule

• Converges to principal component

• Normalized to unit vector

∆w = η ( yx − y

²

w )

(22)

Multiple principal components

• Project out principal component

• Find principal component of remaining variance

(23)

clustering vs. PCA

• Hebb: output x input

• binary output

– first-order statistics

• linear output

– second-order statistics

(24)

Data vectors

• x_a means ath data vector

– ath column of the matrix X.

• X_ia means matrix element X_ia

– ith component of x_a

• x is a generic data vector

• x_i means ith component of x

(25)

Correlation matrix

• Generalization of second moment

x

_i

x

_j

= 1

m X

_ia

a=1

∑

m

^X

^ja

xx

^T

= 1

m XX

^T

Principal component analysis

Principal component analysis

Hypothesis: Hebbian synaptic plasticity enables perceptrons

to perform principal

component analysis

Outline

Principal component

Variance

δ x = x − x

δ x

( )

= x

− x

Covariance

δ x

= x

− x

δ x

= x

− x

δ x

δ x

= x

x

− x

x

Covariance examples

Covariance matrix

C

= x

x

− x

x

Principal components

Handwritten digits

Principal component in 2d

One-dimensional projection

Covariance to variance

w

x

( )

− w

x

= w

Cw

= w

C

w

∑

Maximizing parallel variance

w

= arg max

w

Cw

λ

( ) C = max

w

Cw

= w

Cw

Orthogonal decomposition

Total variance is conserved

Rubber band computer

Correlation/covariance rule

∆w = η yx

∆w = η ( y − y ) ( x − x )

Stochastic gradient ascent

y = w

x ∆w = η xx

w E = 1

2 w

Cw C = xx

∆w = η∂ E

∂ w

Preventing divergence

Oja’s rule

∆w = η ( yx − y

w )

Multiple principal components

clustering vs. PCA

⁼ ^x

⁻ ^x

⁻ ^w

^x

⁼ ^w

^Cw

( ) C ⁼ ^max

∆w = η ( y − y ) ( ^x ⁻ ^x )

∆w = η∂ ^E

^X