• No se han encontrado resultados

6. Resultados experimentales

7.2. Trabajos futuros

Recent work by Sivic and Zisserman (2003) on video and slightly earlier work by West- macott and Lewis (2003), showed a new approach to object matching within images and video footage. The approach was based on an analogy with classical text retrieval using a vector-space model. This section shows how this analogy can be applied to content based retrieval from still frames using local descriptors generated from salient regions.

4.2.1 Applying Text Retrieval Techniques to Image Retrieval using Salient Regions

In this section, the ideas and methods described above for text retrieval are taken and applied to image retrieval. The analogy used is that an image is a document, and consists of multiple terms, or ‘visual’ words. In the previous chapters, the use of saliency as a means to build image descriptions was discussed. In order to build the ‘visual’ words for an image, it is suggested that each word is formed from a local description of the image in a salient region.

4.2.1.1 Building visual words: Vector Quantisation

One immediately obvious problem with taking local descriptors to represent words is that, depending on the descriptor, there is a possibility that two very similar image patches will have slightly different descriptors, and thus there is a possibility of having an absolutely massive vocabulary of words to describe the image. A standard way to get around this problem is to apply vector quantisation to the descriptors to quantise them into a known set of descriptors. This known set of descriptors then forms the vocabulary of ‘visual’ words that describes the image. The process is essentially the equivalent of stemming, where the vocabulary consists of all the possible stems.

The next problem is that of how to design a vector quantiser. Sivic and Zisserman (2003) selected a set of video frames from which to train their vector quantiser, and used the

k-means clustering algorithm to find clusters of local descriptors within the training set of frames. The centroids of these clusters then became the ‘visual’ words representing the entire possible vocabulary. The vector quantiser then worked by assigning local descriptors to the closest cluster.

In the QMNS algorithm of Westmacott and Lewis (2003), RGB colour-space was quan- tised by splitting into regular hypercubes. The RGB-space was split into 4 bins per dimension, resulting in 64 bins total. These 64 bins correspond to a ‘visual’ vocab- ulary of 64 terms. Westmacott and Lewis also indexed bi-modal colour distributions as RGBRGB pairs and tri-modal distributions as RGBRGBRGB triples as 4096- and 262144-term vocabularies respectively.

The two approaches outlined above work well in different situations. The clustering- based approach is able to cope with very high-dimensional feature vectors, such as in the 128-dimensional SIFT features. The second approach only works well in low-dimensional spaces; for example, if we were to split each dimension of the SIFT features into 4 bins, we would have a vocabulary of 4128 1077 terms, which would be far to big to handle

practically.

In this work, both approaches were used. In order to create vocabularies for the SIFT descriptors, the batchk-means clustering algorithm was used: vocabularies were created for each of the image-sets by randomly sampling 100,000 SIFT features and clustering for a number of different k values with randomly chosen start points. The clustering was performed a number of different times for each k value in order to select the best vocabulary. Each image in the image-sets then had its SIFT descriptors quantised by assigning the descriptor to the closest cluster. In order to create visual terms from the dominant colour descriptor described in 3.3, the second quantisation approach was used. Instead of indexing the raw RGB values, the colours are converted to Hue and Saturation values and these are quantised. This is to enable partial illumination invariance within the descriptors, as well as make the colour-space more like the perceptual space. The Hue and Saturation values are quantised into 60 terms by binning the Hue into 30o segments, and the Saturation into 5 bins, as shown in Figure 4.3. Colour pairs are represented as terms in a 3600-term vocabulary. In order to keep the vocabulary size down, salient regions with more than two dominant colours are represented by the two most-dominant colours within the region.

Zipf Again. Given the ubiquity of Zipf’s law in natural languages, it is interesting to see if it holds for the pseudo-artificial vocabulary of visual words created by the vector quantisation of feature vectors. It would also be interesting to investigate whether Zipf’s law can be used in choosing the optimal size of vocabulary. Figure 4.4 illustrates the

Figure 4.3: Illustration of how the hue and saturation are quantised to form a vocab- ulary of colour ‘visual’ terms. Each segment of the colour wheel represents a ‘visual’

term. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 2000 4000 6000 8000 10000 12000 Word Frequency

Word Rank (sorted by Frequency)

3000 Word Vocabulary 6000 Word Vocabulary 12000 Word Vocabulary

Figure 4.4: Rank-frequency curves for the ‘visual’ words of varying size vocabularies. The curves are generally Zipfian in nature, but the smaller vocabulary shows a large

drop-off at its tail, possibly indicating that the vocabulary is too small.

rank-frequency curves calculated over the entire Washington data-set using a range of vocabulary sizes. It is interesting to note that each of the curves is approximately Zipfian, although the tail end of the smaller vocabulary curves has a noticeable non- Zipfian drop-off. A question arises as to whether this could be an indication that the vocabulary is somehow deficient, due to a lack of more descriptive visual terms.

Figure 4.5: Generating vectors of occurrences of ‘visual’ terms from an image.

4.2.1.2 Image Retrieval based on visual words

Given the lists of ‘visual’ words for each image in the data-set, the next stage is to calculate a word-frequency vector to represent each image. The overall process of getting from an image to a vector of word occurrences is shown in Figure 4.5.

The Classical Approach Thetf-idfweighting (Equation 2.1) is used to weight each element of the word-frequency vector. In order to perform actual retrieval, a query vector, Vq is constructed from the query image, and all the documents in the database

are ranked by the normalised scalar product (Equation 2.2) between the query vectorq

and each document vector Vd.

Stop Lists and Spatial Consistency. As with text retrieval, it is possible to apply a stop-list analogy to the ‘visual’ words. Currently, this has not been implemented,

however, it is easy to see that by adding the most common ‘visual’ words to a stop-list, much of the noise from non-discriminatory words will be removed.

It is also possible to apply constraints based on the spatial arrangement of the salient regions. This is akin to text-search methodologies where the rank of the document is increased if the query terms occur close together in that document. In terms of images, the spatial arrangement could be rigid, and tested by checking for consistent homographies between clusters of matches as in Chapter 5. Alternatively this could be measured loosely by just requiring that neighbouring matches in the query lie in a surrounding area in the retrieved image.

The Latent Semantic Indexing Approach The LSI approach to text retrieval can also be applied to image retrieval using salient regions. Given the list of ‘visual’ words for each image in the data-set, it is possible to construct the term-document matrix, apply log-entropy weighting and decompose into subspace, just as one would for text documents.