BANCA ONLINE - Multicanalidad y Omnicanalidad en banca

6.2.1 Extending the Scope of the Model

I have shown here that the same sparse coding model successfully employed by Ol- shausen and Field to model V1 can also be fruitfully applied to a much higher level of the visual hierarchy. That is not to say that the top and bottom of the hierarchy are the only places where sparse coding may be advantageous. In Chapter 3 I provided

evidence that sparse codes may be advantageous from a metabolic point of view, and in Chapters 4 and 5 I argued for their computational utility. It is reasonable to expect that the principles of sparse coding described in this work could be fruitfully applied throughout the visual hierarchy to enhance the coding efficiency of visual inputs. In its current state, the feature selectivity of the HMAX network used to generate most of the results of Chapter 5 is simply memorized from a random selection of

images. That is, in each S layer (aside from S1, in which V1-like oriented bar filters

are used), each neuron is given weights by propagating some image patch through the network up to that point and memorizing the resulting pattern of activity on its afferents as a template feature. While this method should capture the statistics of natural images, it makes no effort to build a particularly efficient representation as a sparse coding network does, and it must make use of millions of neurons in the intermediate layer in order to capture enough image features to support recognition. It is therefore reasonable to expect that applying the coding strategy described in this thesis throughout the hierarchy of a simple-complex processing network like HMAX (that is, at each simple cell stage) could provide a performance improvement both in fidelity of representation and in number of coding units required.

The primary obstacle to this approach is one of available computational resources— the intermediate layers of the vision model used here consist of millions of simulated neurons, and so the model is only tractable because these neurons operate in a purely feedforward fashion. By contrast, interactions between neurons in the same layer are crucially important to our sparse coding scheme, and so a more efficient means for computing the equilibrium of the network (and thus computing the representation) would be required. A few ideas may be of use here. First, if we assume the sparse coding network will truly learn a more efficient code for image features, we may be able to reduce the number of representing units. Second, we can exploit the sparse

speed computation at the recognition stage, though this will not improve training speed. Finally, it may be possible to impose a sparse set of interconnections between neurons even during training rather than the full connectivity used in this work.

6.2.2 Multi-Modal Perception

One of the most striking results from human MTL reported by Quian Quiroga and colleagues was a neuron that, in addition to its robust invariant response to various images of the actress Halle Berry, responded vigorously to the letter string “Halle Berry” (Quian Quiroga et al., 2005). Furthermore, pilot data from continuing studies in this area reveal MTL cells that respond strongly to the name of their preferred

stimulus spoken aloud by a computer (R. Quian Quiroga, personal communication).

One intriguing area of future work is therefore to extend the computational work described in Chapters 4 and 5 to other forms of sensory input. In principal the same machinery for sparse coding should be sufficient—at the level of abstraction of the inputs to the sparse coding model, namely image features, nothing is specifically designed or tuned to the visual mode. Furthermore, evidence of neural plasticity across brain areas suggests that it may be worthwhile to seek general computational structures that apply across sensory modes (Pascual-Leone, Amedi, Fregni & Mer- abet, 2005, and references therein). The same methodology applied to an invariant representation of written or spoken words may be successful in extracting the sparse structure present therein.

Though written words (text) enter the brain through the visual system, evidence from fMRI experiments suggests that specialized machinery for the holistic process-

ing of words develops in the Visual Word Form Area as reading skills are acquired

(Gaillard et al., 2006; McCandliss, Cohen & Dehaene, 2003, and references therein) (but see criticisms of this view in Price and Devlin (2003)). For this reason it may be appropriate to treat written words as a distinct sensory mode and investigate the

application of our coding methodology to it. To generate a representation of text invariant to transformations such as changes in scale and font one may either use one of many sophisticated systems for optical character recognition (OCR) currently available on any input images containing words, or simply apply the model directly to text represented as such. A minor distinction between this mode of input and the vision system model described above is the crucial importance of the spatial rela- tionship between letters—in the vision model excellent performance is possible even when most spatial information is discarded, while in text rearranging letters generally destroys the meaning (though some robustness to this is present, as long as the first and last letters of a word are preserved).

In the auditory domain, one could use existing models of auditory language processing to project auditory signals into a space wherein the same word spoken by different individuals or otherwise manipulated will produce a similar representation. Smith and Lewicki (2005) discuss methods for developing efficient, shift-invariant rep- resentations for natural sounds using spiking models, among them a sparse generative model much like that employed by Olshausen and Field in the visual domain. Such a representation can then be fed into the sparse coding network, which could extract structure, such as commonly used words, from the input data stream.

The goal of this line of inquiry would be to replicate the multi-modal response characteristics observed in the human MTL recordings. Further, significant evidence from fMRI and psychophysical experiments indicate that this type of cross-modal interaction plays an important role in perception (Shimojo & Shams, 2001, and references therein). A likely strategy here would be to feed into the sparse coding network not inputs from a single sensory mode, but simultaneous inputs from multiple modes. For example, one would present the image of Halle Berry together with her name spoken aloud, each processed through the appropriate invariance model. The model will then be able to associate the inputs across modes, in essence allowing each mode

to act as a supervisory signal for the other(s). Computational studies have already shown the utility of such multi-modal “self-supervision” in performing the unsuper- vised clustering task of learning vowels from spoken English using both auditory and visual information (Coen, 2006).

In document Multicanalidad y Omnicanalidad en banca (página 41-44)