• No se han encontrado resultados

3.1 Clases de arbitraje

3.1.4 Por la naturaleza del conflicto

Bag of visual words was introduced in retrieval by Sivic and Zisserman in their sem- inal “Video-Google” [123] work, where they presented image/video retrieval system inspired by text retrieval. Concurrently, Csurka et al. [24] showed the interest of this ap- proach for image classification. Since then, over the years it has been one of the most popular and successful approaches proposed in the field of visual recognition, which has been extended and improved in many ways for image retrieval [60, 64] and clas- sification [107, 163]. It has been also employed by many methods for object localiza- tion [76,137,142] and action classification [34,44,79,146].

From local descriptors to global representation is a 3-step process: (1) Visual Codebook Creation, (2) Encoding and (3) Aggregation. These are common for BOW and other related global representations. To our knowledge, the NeTra toolbox [86] is the first work to follow this pipeline.

Visual vocabulary creation:The first step is to build a codebook or a visual vocabulary. It consists of a set of reference vectors known as “visual words”. These are created from a large set of local descriptors often using an unsupervised clustering approach to parti- tion the descriptor space into, say, K clusters. They are associated with K representative vectors (µ1, µ2, ..., µK), referred to as visual words. A visual vocabulary or codebook was initially built by using k-means clustering [24,123,152]. Later its variants were in- troduced such as more efficient hierarchical k-means [99] or Gaussian Mixture Model (GMM) [107] to obtain a probabilistic interpretation of the clustering. It is well known that the classification accuracy improves when increasing the size of the codebook, as experimentally shown in some works [21, 140]. There are other approaches to learn a codebook such as meanshift-clustering [68] and sparse dictionary learning [87,149].

Also methods with supervised dictionary learning have been proposed [73,96,108,152].

Encoding local descriptors:The local descriptors are encoded using the visual vocabu- lary. The simplest way is hard assignment where each descriptor is assigned to its near- est visual word. This vector quantization (VQ) is the simplest and probably the most popular encoding. Though this representation is very compact, VQ is lossy and leads to quantization errors. Choosing the vocabulary size (K ) is a trade-off between quantiza- tion noise and descriptor noise. Depending on the size (for large K or small clusters) we may have very similar descriptors assigned to different visual words. Conversely, with larger clusters very different descriptors may get assigned to the same cluster.

Several better encoding techniques have been proposed to limit quantization errors. For instance, coding techniques based on soft assignment such as [108, 152], Kernel code- book [112,141] or sparse-coding [15,87,149,158]. In Kernel codebook, each descriptor is softly assigned to all the visual words with weights proportional to exp(−d22), where

d is the distance between cluster center and descriptor. Soft-assignment has also been used with a sparsity constraint on the reconstruction weight in Mairal et al. [87] and Yang et al. [158]. Locality-constrained linear coding by Wang et al. [149] reconstructs a local descriptor by using a soft-assignment over a few of the nearest visual words from the codebook. Another coding technique, namely Super-vector coding [163], is similar to Fisher and VLAD, which we discuss in the next sub-sections.

Aggregating local descriptors:The final step is to aggregate the encoded features into a global vector representation. The features are pooled in one of the two ways: sum pool- ing or max pooling. In sum-pooling, the encoded features are additively combined into a bin or a part of the final vector. For example, in case of BOW histogram it is simply adding the count of features belonging to bin corresponding to each visual word. For max pooling, the highest value across the features is assigned for each bin, as done in Yang et al. [158]. Avila et al. [9] propose an improved pooling approach called BOSSA (Bag Of Statistical Sampling Analysis). In this work, the local descriptors are pooled (sum pooling) per cluster based on their distances from the cluster center and the result- ing histograms from all the clusters are concatenated. Recently, this approach is further improved in an approach coined BossaNova [8].

The bag-of-words is an order-less representation and thus invariant to the layout of the given image or video. A standard way of incorporating weak geometry is to divide the image into spatial grid (or spatio-temporal in case of video) and aggregate each spa- tial cell separately. This was first introduced by Lazebnik et al. [81] as spatial pyramids consisting of several layers, where in layer l the image is divided in 2l × 2l, each such cell being described by a histogram. Since then, many different types of spatial grids have been used, e.g., 3× 1 and various spatio-temporal grids for videos. The concept can be used for any of the encodings mentioned here, including VLAD and Fisher, by

Enhanced image and video representation for visual recognition

computing one encoding for each spatial region and then stacking the results.

In the case of image classification, the aforementioned global vectors are used to train a model. For retrieval with BOW, the following process is followed.

BOW in retrieval: matching with Visual Words

BOW histogram or a vector of visual words frequencies is L2 normalized. The rare visual words are assumed to be more discriminative and are assigned higher weights; this is done according to the tf-idf (term frequency–inverse document frequency) scheme [123]. The similarity measure between two BOW vectors is, most often, cosine similarity. Vo- cabulary size (K ) is a parameter which is usually very large for image retrieval (typically 20, 000 or larger). With around few thousand descriptors per image the BOW histograms are very sparse, which enables very efficient retrieval, thanks to inverted file indexing. Inverted file is a set of inverted lists, where each list is associated with a visual word. All the local descriptors corresponding to the same visual word are stored in the same list with their image ids. Another option is to store image id and number of descriptors in the image belonging to that particular visual word.

Documento similar