ANEXO N°5 RESULTADOS
8.1 ENTREVISTA EN PROFUNDIDAD DIRIGIDA A LOS PADRES DE FAMILIA
We first of all select the problem of character recognition on MNIST data set which contains 28×28 gray scale handwritten digits derived from a larger database called NIST (Lecun et al., 1998). The number of classes in the database are 10 (digits ranging from 0 − 9) with 60, 000 training and 10, 000 test images.
Figure 5.3: Sample of binary digits taken from the MNIST handwritten digits data set.
5.4.1.1 Image Preprocessing and Data Modelling
The digit images are first converted into binary images and then passed on to the visible layer of 784(28×28) units. Each unit in the model has a sigmoidal activation function σ(x) = 1+exp(−x)1 , that acts on the input coming up from the opposite layer. Thus, the hidden and visible units are updated according to the condi- tional distributions specified in Equations 4.11 and 4.12. The number of epochs for stochastic gradient descent learning of parameters were fixed to 10. Other pa- rameters that are significant for building and training this generative model are learning rate (0.005), initial momentum (0.5), final momentum (0.9), penalty for the weight decay factor (0.0002) and batch size5. A guide to initialize and optimize these parameters is given by Hinton (Hinton,2010). We have used contrastive di- vergence (CD-1) explained in Section4.4.1to approximate the gradient of the log likelihood function of RBM and updated the model parameters θ = {W, a, b} via
the following rule: θ ← θ + η(∇θlog P (v; θ)), where log P (v) = log P hexp(−E(v, h)) P v,hexp(−E(v, h)) ! = logX h exp(−E(v, h)) − logX v,h exp(−E(v, h)).
The energy function E(v, h) of the binary-binary RBM and its respective prob- ability distributions to maximize the likelihood of the data have been described previously in Section4.4.
5.4.1.2 Results
We draw three different kind of classifiers to calibrate the performance of this gen- erative model; the first is a maximum likelihood based classifier, second a Fisher kernel based discriminative classifier and third is a ClassRBM. In order to clas- sify digits with likelihood based approach, we train each RBM model with a dif- ferent class of digits. The partition function, Z(θ) = P
v P
hexp (−E(v, h; θ)) of each probability model is calculated through annealed importance sampling (AIS) (Salakhutdinov & Murray, 2008) and then the label of the test data is estimated via Equation5.6. For Fisher kernel calculation, we pool all the training data from each class and train a single RBM model with the optimal parameters. This training data that was used to train the RBM was also used to train the SVM with Fisher kernel calculated as follows:
K(xi, xj) = φTxiφxj, where x −→ φx.
The Fisher score φx is derived from the generative model as: ∇θlog P (xn|θ) =S[n] | Q[n]| U[n] , where S[n]= ∇Wlog P (xn|θ) = hvhTiPdata− hvh
Ti Pmodel,
Q[n]= ∇alog P (xn|θ) = hhiPdata− hhiPmodel,
U[n]= ∇blog P (xn|θ) = hviPdata− hviPmodel. (5.8)
The derivation of these gradients is shown in Appendix A. Figure 5.4(a) shows the classification performance achieved by each of these methods as the learning capacity of the RBM is increased with the addition of the hidden units. With ref- erence to this experiment, Figure5.4(b) shows the respective CPU time consumed by each of the competing algorithms at different scales. The Fisher kernel derived
5
For ClassRBM on MNIST task, a batch size of 10 was maintained as suggested in the literature, whereas for the RBM generative and Fisher kernel RBM models full batch size was chosen for model training.
Table 5.4: Growth of Fisher vector length in case of MNIST data set.
No of hidden units 1 10 100 1000 6000
Fisher vector length 1569 8634 79284 785784 4710784
(l = nv+ nh+ nv×h)
(a) Classification accuracy (b) Overall time complexity on logarithmic scale
Figure 5.4: Comparison of the classification performances achieved by the RBM generative model (η = 0.005), Fisher kernel RBM (η = 0.005) and ClassRBM (η = 0.05 ) on MNIST data set. The overall computation time of training and testing is also
shown on a logarithmic scale.
from the classical RBM shows a significant boost in the performance attained by the RBM generative model through maximum likelihood approach, and is also found much better than the ClassRBM at small scale. As the model becomes shallow with increasing number of hidden units, the derived Fisher kernel shows a trend of overfitting due to the massive number of model parameters that make Fisher vectors immensely large (i.e. of the order of magnitude 106 at 6000 hidden units), thus preventing the classifier from generalizing well despite regularization. This experiment was carried out on full MNIST training and test sets where the SVM training for Fisher kernel was carried out through stochastic gradient descent (SGD) learning approach as suggested by Bottou (Bottou et al.,2008).
We emphasize on the need of using an online approach for training the SVMs with Fisher vectors from RBM as their storage and retrieval becomes extremely costly through batch algorithms when the size of the data and RBM model is in- creased (Sanchez et al.,2013). Table5.4highlights how the number of parameters are increased as we increase the number of hidden units of RBM. To calibrate the storage cost, consider a double precision floating point integer of 8 bytes, then a single signature of 79284 variables obtained from a 100 hidden units RBM would require 634KB of storage. This implies that for the whole MNIST data set of 60,000 training data points, the amount of storage required is 35.4GB. As we scale the size of this model to 6000 hidden units, at which the state of the art meth- ods have shown the best performance on MNIST (Larochelle & Bengio, 2008),
(a) Zoomed image of training time complexity (b) Overall train time complexity
Figure 5.5: Comparison of the CPU-time taken by all techniques during the training phase is shown; the zoomed image for small scale models on MNIST is shown on the
left hand side.
(a)Zoomed image of test time (b) Overall test time duration
Figure 5.6: Comparison of the CPU-time taken by all techniques for the test phase is shown; the zoomed image for small scale models on MNIST is shown on the left hand
side.
the storage requirements of Fisher vectors rise to approximately 2TB. Note that this is not entirely a storage issue since handling tera bytes of dense data makes experimentation very difficult if not impractical. Techniques like the decompo- sition methods (Osuna et al., 1997) and shrinking (Joachims, 1999a), all offer a way to avoid the unneeded full kernel matrix computation, however storing and retrieving large Fisher vectors from/to the hard disk may take significant amount of time without performing any useful calculation. In order to solve this stor- age issue of large dimensional Fisher vectors, some compression techniques like PQ encoding, local sensitivity hashing and spectral hashing have recently been introduced (Sanchez & Perronnin, 2011). Likewise, another way of resolving this computational issue is to use stochastic gradient descent (SGD) learning rule for
Table 5.5: Performance achieved by state of the art methods on full MNIST digits data set.
Algorithms % Error
SVM (Gaussian Kernel, c =4, γ =0.031, Input=Image pixels) 4.51%
SVM (Linear Kernel, Input= Image pixels) 2.33%
K-Nearest Neighbor (Eucledian, L2; k=1; Input = Image Pixels) 5% K-Nearest Neighbor (Eucledian, L2; k=1; Input = Fisher scores from RBM (10 hid. units)) 4.94%
Convolution Neural Network (CNN) (Ciresan et al.,2012) 0.23%
Deep Belief Networks (DBN) (Hinton et al.,2006) 1.25%
Discriminative RBM (η=0.05, h=500) (Larochelle & Bengio,2008) 1.81%
ClassRBM (η=0.005, h=6000) 3.39%
SVM Fisher Kernel ( h = 10) 9%
training SVMs so that the classifier learns the parameters on mini batches of Fisher vectors convenient for processing. We have used this learning rule for SVM in all of our experiments except for the CalTech 101 data base where the classification accuracy obtained through sequential minimal optimization (SMO) algorithm was comparatively better than SGD optimizer for SVM. A summary of the classifica- tion results on the digits database is shown in Table5.5, where the Fisher kernel performance is compared to the other state of the art accuracies. The proposed method does not supersede the best reported performances, yet it gives results in the same league in a very small compute time.