8. ANEXOS
8.1 Anexo A Encuesta: Selección de temas para los objetos virtuales de aprendizaje
The idea of Fisher kernels has been around since 1998, and was pursued quickly by other researchers who applied it for classification in different applications of biology, speech, vision and text. In this section, we group the use of Fisher kernels in different applications to see how it has been utilized and evolved with time. After Jaakkola et al. (Jaakkola et al.,2000) showed that using the Fisher kernel derived from the hidden Markov models (HMM) significantly improves on the previous methods of protein domain classification, Moreno and Rifkin (Moreno & Rifkin, 2000) adopted this method for large scale web audio data classification. The underlying probability distribution from which the Fisher vectors were drawn was a Gaussian mixture model. Smith and Niranjan (Smith & Niranjan,2001) gave some further experimental justification for using the Fisher kernel in audio data classification domain by emphasizing that the Fisher kernel limits the dimensions of the feature space to give some beneficial regularization, particularly when the two classes are very inseparable. Smith and Gales (Smith & Gales,2002) further extended the standard likelihood based score space of the Fisher kernel to likelihood ratio based score space, and showed that it outperforms the classical score space and HMMs trained to maximise likelihood on speech recognition task.
Further research soon showed that when the data is costly to label, or is partially labelled, Fisher kernel could still be deployed efficiently with an SVM that uses transductive inference learning scheme (Joachims, 1999b). A case study showing the successful use of Fisher kernels with labelled and unlabelled data from Medline database of abstracts, is given by Goutte (Goutte et al.,2002). Vinokourov and Girolami (Vinokourov & Girolami,2001) also applied the Fisher kernel for docu- ment classification problem, where the Fisher vectors were derived from a proba- bilistic hierarchical clustering model that was a mixture of standard multinomial and probabilistic latent semantic analysis models. Elkan (Elkan, 2005) investi- gated the Dirichlet compound multinomial (DCM) distribution for the derivation
of Fisher kernel and showed better document classification results than the alter- native kernels. Chappelier and Eckard (Chappelier & Eckard,2009) modelled the documents through probabilistic latent semantic indexing (PLSI) and introduced a new, rigorous development of the Fisher kernel for PLSI by addressing the sig- nificant role of the Fisher information matrix and its relationship to the proposed kernel. Some of the other application areas where Fisher kernels were quickly pur- sued are logical sequence classification (Kersting & Gartner,2004), topic based text segmentation (Sun et al.,2008), sign language recognition (Aran & Akarun,2010) and currency prediction (Fletcher & Shawe-Taylor, 2013). This recent work on currency prediction facilitates the canonical market microstructural models based around three main families: Autoregressive conditional duration models, Poisson processes and Weiner process to be efficiently utilised into the discriminative learn- ing framework via Fisher kernels.
For object classification problem, Holub et al. (Holub et al.,2005) were the first to highlight the performance gains on standard object recognition data sets from Cal- Tech by successfully combining the probabilistic constellation model with Fisher kernels. Following them, Perronnin and Dance (Perronnin & Dance,2007) applied the Fisher kernel framework to a visual vocabulary of low-level feature vectors extracted from images and modelled via the Gaussian mixture model (GMM). They showed that the proposed approach is actually a generalization of the pop- ular bag-of-visual words (BoW) approach since for the same vocabulary size N , the gradient representation of the Fisher kernel has a much higher dimensionality (2 × D + 1) × N − 1 than the histogram representation (N ). In case of a Gaus- sian mixture model, the BoW approach is directly related to the Fisher kernel when the gradients with respect to the weight parameters wi are considered only: they both consider 0-th order statistic (word counting). However, the derivatives with respect to the means and standard deviations consider the 1st and 2nd order statistics too, thus enriching the overall representation of the images with compact vocabularies. This dimensionality enhancement makes the image representation more informative even when the available vocabulary is limited, thus leading to a computationally attractive approach. See Figure 3.11for illustration of the BoW model.
Since then, the Fisher kernel has been tested for classification on many large scale object recognition data sets such as CalTech-256, PASCAL VOC 2007, PASCAL VOC 2008 and ImageNet LSVRC 2012 (Perronnin et al., 2010b; Sanchez & Per- ronnin,2011; Csurka & Perronnin,2011;Sanchez et al.,2013). It has constantly proven to be empirically better than the state of the art bag of the words (BoW) model of object recognition (Csurka & Perronnin,2011) in several respects: First, it provides a more general way to define a kernel from a generative process of the data. Secondly, it can be computed from much smaller vocabularies since it does not rely on the total number of occurrences of each visual word rather encodes
Figure 3.11: Diagram illustrating the main idea of the bag of words(BoW) model
of image representation. Local descriptors are extracted from the image and each
descriptor is assigned to its closest visual word in a visual vocabulary: a codebook obtained offline by clustering a large set of descriptors with k-means. A trend in BoW approaches is to have multiple combinations of patch detectors, descriptors and spatial pyramids. Systems following this paradigm have consistently performed the best in the successive PASCAL VOC evaluations, yet the Fisher kernel has shown to outperform
this classical model for the advantages mentioned in the text.
additional information about the distribution of the descriptors. This results in lower computational cost. Third, its classification performance ranks among the best in a wide range of problems, despite relying on simple linear classifiers. A significant benefit of linear classifiers is that they are very efficient to evaluate and learn (linear in the number of training samples) using techniques such as stochastic gradient descent (SGD) learning(Bottou et al., 2008). Thus, Fisher vectors serve as an efficient alternative to the BoW histograms. Currently, the second best performance on the ImageNet-10K classification task, after the deep convolution network (Krizhevsky et al.,2012), is achieved by the Fisher kernels (Sanchez et al.,
2013) derived from a Gaussian mixture model built for SIFT, LBP and GIST data descriptors.
Despite the various advantages Fisher kernel paradigm offers, it also suffers from a limitation in comparison to the BoW approach: while the latter is typically quite sparse because of the counts measure, the FV is mostly dense (Sanchez et al.,2013). This leads to storage as well as input/output issues which makes it impractical for large-scale applications. This computational difficulty is resolved by compressing Fisher vectors through PCA or Hash kernels (Shi et al., 2009) and then coding them with Product Quantizers (Jegou et al., 2011) to retain the advantages of high dimensionality representation. These improvements have shown to work very well in terms of the recognition performance without paying an expensive price in terms of memory and I/O usage. It is also important to note that most of the literature ignores the use of Fisher information matrix F in the Fisher kernel construction. This invertible covariance matrix of Fisher scores is considered asymptotically immaterial (Jaakkola & Haussler, 1998) and
is often ignored in practice. The resulting practical Fisher kernel (Shawe-Taylor & Cristianini, 2004) thus replaces the Fisher information matrix with an identity matrix and simply uses the gradients as features without any further rescalings or normalizations. In some works, it is replaced by a diagonal approximation of the Fisher information matrix that is easy to compute than the whole d×d dimensional matrix (Perronnin & Dance,2007;Nyffenegger et al.,2006).
The literature discussed above highlights the significance of the use of Fisher ker- nel in different applications, yet we emphasize that none of the previous work has shown the utility of Fisher kernels for restricted Boltzmann machines(RBMs). In this work, we have attempted to bridge the gap between the widely used deep gen- erative models and the discriminative kernel paradigm by drawing Fisher kernels from RBM, and shown that the shortcomings of the compact generative models could be resolved if Fisher kernel is derived from them for the classification task.