• No se han encontrado resultados

If the analysed signal consists of a superposition of features at arbitrary locations, then the model used to learn these features has to have enough free parameters to represent these features. In general this means that at least one feature has to be learned for each feature present. However, in the standard sparse coding model, features have to be learned at all possible shifts, so that the number of features to be learned is much larger than the number of features in the signal. If the standard sparse coding model does not have enough free parameters to represent the features in the signal, not all features are learned. Instead, some features have to be used to model more than one feature in the observation.

In this section we study the influence of the number of features used in the traditional sparse coding model, when this number is smaller than the number of features in the signal. We assume here that the observed signal follows the model x=P

aksk+ǫ. ˆaˆk and ˆsˆk are used to denote the

to denote the features and the associated shifts of the underlying process, while ˆk indexes the learned features.

The expected ML estimate of a feature ˆaˇk w.r.t. the distribution of

the data, i.e. w.r.t. the distribution of ǫ and s, is the value for which the expected gradient is zero. We can write this expected gradient as:

h∆ˆakˇip(ǫ,s)= * µσǫ−2 Z   X k aksk− X ˆ k ˆ aˆkˆsˆk+ǫ  sˆkˇ p(ˆs|x,A)ˆ dˆs + p(ǫ,s) .

Note the use of ˇk to index the particular feature for which we evaluate the gradient and the corresponding coefficient sˇk, whilek indexes the true

features in the generative model. ˆk indexes all of the estimated features and coefficients. Using the abbreviation

T =µσ−ǫ2   X k aksk− X ˆ k ˆ aˆkˆsˆk+ǫ  sˆkˇ

we can write this as:

Z Z Z

Tp(ˆs|x,A)ˆ p(ǫ,s) dˆs ds dǫ

=

Z Z Z

Tp(ˆs|ǫ,Aˆ,A,s)p(ǫ)p(s) dˆs ds dǫ ,

where the last step is possible as s,A and ǫ definex and as ǫ is assumed to be independent ofs. Setting the gradient to zero and rearranging gives:

ˆ aˇkhsˆkˇsˆˇkips|AAˆ ) = akhskˆsˇkips,s|AAˆ ) + X k6=k akhskˆsˇkips,s|AAˆ ) − X ˆ k6=ˇk ˆ aˆkhˆsˆkˆsˇkips|AAˆ ) + hǫˆskˇips|AAˆ ).

where we have introduced the index k to label the true feature and co- efficient associated with the feature and coefficient to be learned, i.e. we assume that feature ˆaˇk converges to feature ak. If we assume that ˆsˇk is

CHAPTER 3. SHIFT-INVARIANT SPARSE CODING 59

to the assumed independence of the individual ˆsˆk. So we are left with:

ˆ

akˇhsˆˇkkˇips|AAˆ ) = akhskkˇips,s|AAˆ )

+ X

k6=k

akhsksˆkˇips,s|AAˆ )

In order for a feature ˆaˇk to converge to a feature ak we require the corre-

lation between ˆsˇk and sk to be zero for allk 6=k.

If the number of features used to model a signal is less than the number of features in the signal at all locations, then dependencies between ˆsˇk and

several sk have to occur. Dependencies can also occur as a result of the

inference process or the approximations to the learning rule used.

To analyse the possible dependencies which can occur due to the in- correct model size, we assume that all learned features have converged to some of the true features. The dependency between ˆskˆ and sk (and

therefore the exact form of the averaging process described above) then depends on which of the featuresak are modelled by each feature ˆakˆ. The

feature chosen to model a feature which has not been learned, depends on the decrease in reconstruction error when using this feature. The highest decrease in this error is achieved by modelling a feature in the signal with the same feature at the exact location. If this feature is not available at this location, a feature at a different location or a different feature has to be used.

In the following list three forms of dependencies which can occur are given together with the influence they have on the learned features:

• A feature can be modelled with a slightly shifted version of itself. If several slightly shifted features are modelled by a single feature, then the average update of this feature is a low-pass filtered version of the true feature.

• A windowed periodic feature can be modelled with a version of itself which is shifted by multiples of the period. A weighted averaging over such feature shifts leads to a windowing of the learned feature.

• A missing feature can also be modelled with a different feature. The chosen feature is likely to share a strong frequency component and

is at a location at which both features have the same phase for this component. Averaging then increases this frequency component but might decrease other frequency components, as the phase for those other components might not match.

This seems to suggest that if the number of features to be learned is less than the number of features in the signal, windowed and filtered features emerge. However, the above derivation uses the traditional sparse coding formulation. If shift-invariance is explicitly enforced and if the inference process is working correctly (i.e. the sk are uncorrelated to ˆskˆ for all but

one pair of coefficients) then the first two effects (i.e. the filtering and the windowing) cannot occur.

Documento similar