CAPÍTULO 6: FUENTES DOCUMENTALES Y ANEXOS
3. ANEXOS
Stoianov and Zorzi [160] trained a Deep Belief Network (DBN) to reconstruct the input, given by binary images comprising of rectangular objects of different sizes. They were able to show that visual numerosity emerges as a statistical property of those images, without any preprocessing normalization mechanism, nor any information about numerosity during the training phase. There has been quite a lot of research in comparing (Sparse) Restricted Boltzmann machines1to the neural coding in vision (Lee, Ekanadham, and Ng [88], Bhand et al. [14]). Stoianov and Zorzi’s [160] Deep Belief network, therefore, inherits the same neural plausibility. Moreover, for stressing the biologically plausibility, the greedy pre-training scheme lacked a back propagation fine tuning typical in the standard deep networks used in machine learning. However, from a developmental point of view, the pre training scheme appears quite unfeasible, where all numbers are given in a bunch. A more cognitively plausible solution would require a learning process in which random numerosity samples are given at a time. Such a solution doesn’t seem technically out of reach and could, in principle, be used to explain the progressive sharpening of the Weber fraction observed in developmental numerical psychology. Interestingly the numerosity detectors where found only in the higher layer and the activation pattern reflected the monotonic coding. As we have seen in chapter 4, this alone is able to explain distance and size effects, and it might be used as the input representation to generate a number line coding (as we have seen in Verguts and Fias [179] model). To sound a note of caution, we remind the reader that deep learning models have high capacity and adapt to data statistics, it is therefore interesting to see whether the model trained with natural images is affected in its ability to learn to represent numerosity. At the
7.2. STOIANOV & ZORZI, (2012)
Figure 7.1: Inputs example and architecture used
present stage we will ignore this issue and instead look at the invariance principle the numerosity units are sensitive to.
7.2.1
The Model
The input is given by 52100, 30×30 pixels binary images containing from 1 to 32 randomly placed non overlapping rectangular shapes (Figure 7.1 bottom. On the on-line support material of this thesis are available Matlab and Python scripts to generate the Dataset as described in Stoianov and Zorzi [160] supplementary information2). The architecture used by the authors is a parallel implementation of Geoffrey Hinton’s original code and it is freely available in their website3. Our implementation will soon be added to the GitHub page associated to this thesis4.
The network architecture (Figure 7.1) comprises one visible layer, in which the
2Available at https://github.com/bramacchino/numberSense/tree/master/inputs/sz2012. 3http://ccnl.psy.unipd.it/research/deeplearning.
4At the time of writing I’m still unable to replicate their results fully. The invariance property is
therefore, only hypothesized on the basis of the network description. As soon as I’ll be able to test the invariance property the code will be added.
CHAPTER 7. COMPUTATIONAL MODELS OF VISUAL NUMEROSITY
vectorized input is clamped, and two hierarchical organized hidden layers. In particular, the architecture might be seen as an auto-encoder consisting of two stacked RBMs. Each RBM is formed by a visible and a hidden layer of binary units. The units in the hidden layer fire with a probability that is the logistic function of the weighted input. The input layer of the first RBM comprises 900 binary units fully connected to the hidden layer of 80 binary units, that represents the visible layer of the second RBM with an hidden layer of 400 units. The output layer represents a dimensionality reduced version of the input layer.
The network is trained to maximize the product of probabilities assigned to the training set (i.e. to generate the sensory data), equivalently to minimize the average negative log-likelihood. This in turn is achieved by minimizing the weights (and biases). The minimization is achieved via (stochastic) gradient descent. The derivative gives us two terms, called the positive and the negative gradient. The first depends on obser- vation whilst the latter depends only on the model. Learning is therefore achieved via Contrastive Divergence (CD) (Hinton, Osindero, and Teh [69]). Given an input vectorv+
i , first the feature detectors h+j are activated (positive phase). Starting from stochastically selected binary states of the feature detectors (using their state h+j as a probability to turn them on), CD then infers an input vectorvi used in turn to reactivate the features detectors h−j (“negative” phase). The weights wi j are updated with a small learning fractionηof the difference between input-output correlations measured in the positive and the negative phases:
δwi j=η(v+i h+j−v−i h−j)
7.2.2
Invariance principle
In Stoianov and Zorzi [160] supplementary informations the authors provide a mathemat- ical description of the learned model that help us in assessing the invariance principle. Most of the first hidden layer (HL1) units are center-surround detectors that uniformly cover the image space. The first layer consists of linear operations (2D Gaussian filters, sigma = 2, and spatial integration) followed by a non linear operation (a standard logistic function, f)
Oi j=f( X
Wi j0 I+1)
The numerosity detectors found in the second hidden layer (HL2) are spatially selective as well (2D Gaussian filters, sigma=10). They receive positive input from HL1 units, and inhibition from HL1 units that were found to encode cumulative area (c). That