• No se han encontrado resultados

The second model focuses on capturing the shapes of image features, leveraging the Local Binary Patterns (LBP) technique [109]. Although LBP was originally adapted for the task of text classification [122], the technique has been proven to be effective for face recognition [2][168][179].

Briefly speaking, in its original form (but not in the proposed approach), an LBP is a property of a pixel. All surrounding pixels in an equally sampled, circular neighbourhood with a certain radius value are examined and a string of binary numbers is constructed such that 1 is given if the neighbour pixel’s intensity is greater than the middle, and 0 if the intensity value is equal or less. Only “uniform” bit-strings are considered and assigned to a category specified by the number of 1s in the string. Uniform bit-strings are binary strings with two or less 0 to 1 or 1 to 0 transitions. Consequently, LBP tend to capture curves, peaks, edges and troughs in images [109]. In this approach, it is the LBP shapes formed by the labels and not the pixels that we are most interested in.

Here, the standard LBP approach is modified to treat neighbouring labels as pixels, as well as converting the 8 bit binary string into a decimal number. For example, 00000011 = 3. This enables the assignment of all possible shapes within the 3 × 3 grid to only 256 different bins, which can then be turned into a frequency histogram when this is applied to the entire image label grid. Figure 7.4 depicts an example.

One advantage of this approach is that it is not limited to shapes formed by any particular features. Instead, because only the middle label is used as the reference label to compare its neighbouring labels, this enables the capturing of

Figure 7.4: Representing shapes with LBP-like approach. A 256 dimension his- togram is used to capture the frequency of all possible shapes from image labels.

shapes formed by any labels. In addition, with the original LBP approach, if the neighbour’s intensity exceeds that of the middle pixel, 1 will be assigned to the neighbour, and 0 otherwise. This convention was not adopted because even though labels are represented by a number, they are not related in any form. For example, label 100 is not greater than label 2, as the labels represent different types of features rather than pixel intensities. Instead, only labels around the reference label are searched for matches, since only matching labels are related meaningfully.

7.3

Evaluation

In this section, datasets used to evaluate the proposed algorithms are first de- scribed, followed by a section describing the performed experiments, and the re-

sults are given.

7.3.1

Datasets

The proposed algorithms were evaluated on three popular datasets: Caltech101 (Section 4.1), Graz02 (Section 4.2) and MIT 15 Scenes (Section 4.3).

7.3.2

Methods

The experiment setup and results are reported in this section. Multi-class clas- sification is done with the SVM classifier and the SMO learning algorithm, with default parameters as specified in WEKA V.3.5.5 [167]. All experiments are re- peated ten times with different randomly selected training and testing images.

7.3.3

Experimental Results

The final result is reported as the mean and standard deviation accuracy of the individual runs. Experimental results using only the proposed models are first shown. Following this, results from combining the proposed models with the orig- inal frequency histogram and SPM are given.

7.3.4

Discussion

For such an elegant and simple attempt at capturing spatial information, the pairs frequency method is fairly effective across all three datasets. When combined with BOW frequency histogram, considerable improvements over the original BOW work were repeatedly achieved. This method is fundamentally the same as the BOW frequency histogram; however, it differs in what it tries to capture. Instead

Table 7.1: Results for Caltech101, the proposed methods combined with original SPM.

Spatial Pyramid Matching (SPM) L = 2

BOW Baseline, M = 200 54.90% Pairs Frequency + SPM, M = 200 52.49% ±0.9 Pairs Frequency + SPM, M = 400 54.44% ±0.8 Pairs Frequency + SPM, M = 600 54.36% ±1.1 Shapes Frequency + SPM, M = 200 53.68% ±0.9 Shapes Frequency + SPM, M = 400 53.82% ±0.9 Shapes Frequency + SPM, M = 600 53.55% ±1.0

of single features, this method counts the frequency of pairs of features occurring at a close proximity.

The shapes frequency method, on the other hand, did not perform as well as the pairs methods, usually underperforming BOW by a few percent. The motivation behind this approach was to capture the shape of features, utilizing the LBP scheme. The main reason for the poor performance may be because there are only 256 bins, not M × 256 bins, used to represent all possible shapes, so there is no information about what the pattern is, specific to M . Another reason for the poor performance may be the size of image patches and codebook. The image patch size is 16 × 16 for this work, which may be too large to capture unique image features for the LBP method to take advantage of. The other issue is the codebook size. Since M is relatively small, too many dissimilar image features might have been treated as the same. This is a major disadvantage for ‘strict’ edge-capturing methods like LBP.

Table 7.2: Results for MIT 15 Scenes, the proposed methods combined with orig- inal SPM.

Spatial Pyramid Matching (SPM) L = 2

BOW Baseline 79.4% Pairs Frequency + SPM, M = 200 80.93% ±1.1 Pairs Frequency + SPM, M = 400 81.54% ±1.3 Pairs Frequency + SPM, M = 600 80.56% ±1.1 Shapes Frequency + SPM, M = 200 77.3%±0.9 Shapes Frequency + SPM, M = 400 78.23% ±1.5 Shapes Frequency + SPM, M = 600 77.45%±1.1

Table 7.3: Results for GRAZ-02 (Bike), the proposed methods combined with original SPM.

Spatial Pyramid Matching (SPM) L = 2

BOW Baseline 66.34%

Pairs Frequency + SPM, M = 200 70.11% ±1.8 Pairs Frequency + SPM, M = 400 71.49% ±1.6 Pairs Frequency + SPM, M = 600 70.1% ±1.7 Shapes Frequency + BOW, M = 200 65.92% ±1.9 Shapes Frequency + BOW, M = 400 65.11% ±1.7 Shapes Frequency + BOW, M = 600 64.12% ±1.6

7.4

Conclusion

The goal in this work is to capture geometric information between image features, thus improving the bag-of-words model for object recognition. To this end, two novel spatial information capturing approaches were proposed: pairs frequency and shapes frequency.

Both the pairs frequency models, when combined with the BOW model, have outperformed the original BOW method by approximately 2 to 3% across three diverse datasets. The LBP representation of the shapes frequency, however, did not perform as well.

In [81], Lazebnik et al. found that their spatial pyramid matching scheme is most effective when M = 200. Although they tried different codebook sizes, they did not report any performance gains.

For the proposed methods in this chapter, across all three datasets, experimen- tal results consistently found that the proposed methods work best when M = 400. Perhaps the main reason is that if the codebook size is small, too many unrelated patches will be grouped together, and if the codebook size is large, then similar features will not be seen as the same.

The proposed approaches, though similar to earlier work by Saverse et al. [132] and Wang et al. [164], are different in many respects. The correlograms approach proposed by Savarese et al. captures the distribution of distances between all pairs of image features. These measurements are then used for classification tasks. Inter- estingly, in their paper, correlograms perform much worse than the standard BOW model. In comparison, the pairs-of-feature approach captures spatial information between image features, which is more reliable and efficient.

In Wang et al. [164], quite a different approach to that described in this chap- ter is used, where they represent objects using histograms of oriented gradients. These histograms incorporate detailed spatial distributions of object colour across different parts of the object. However, this method relies on objects having similar poses and the images being of good quality. It is evident that under more realistic conditions, texture and shape information are either non-existent or unreliable due to low image quality. Moreover, the authors in that paper use low level oriented gradients whereas the proposed approaches use higher image features, in the form of SIFT keypoints. Thus, the effectiveness of this method is unclear for the real world object categorization problem.

Log-Polar-Based Image

Subdivision and Representation

In the previous chapter, two novel approaches for capturing spatial information for the BOW model were presented. It was shown that the pairs frequency approach showed improvements over the popular spatial pyramid matching scheme.

In this chapter, new methods to exploit spatial relationships between image features, based on binned log-polar grids are presented. These new methods work by partitioning the image into grids of different scales and orientations, and com- puting histograms of local features within each grid. Experimental results show that the proposed approaches lead to performance improvements on three diverse datasets over the SPM scheme.