DE LA PUBLICIDAD URBANA - MOBILIARIO URBANO.

CAPITULO VII. MOBILIARIO URBANO.

ARTICULO 65. DE LA PUBLICIDAD URBANA

patches encoding limb sections and other key body parts. Unlike the standard bag of features that completely ignores spatial information, however, we find that spatial information is absolutely critical for inferring body pose from an image. We thus include the image coordinates as two extra dimensions4 _{in the patch descriptors while clustering to obtain the part-dictionary. Samples of}

clusters obtained in this fashion are shown in figure 6.3. In our experiments, however, quantizing patches using these centers proves to be incapable of capturing sufficient information to successfully regress pose.

For a comparison with our NMF based encoding (described below), we also independently cluster patches at each of the L locations on the images to identify representative configurations of the body parts that are seen in these locations. Each image patch is then represented by softly vector quantizing the SIFT descriptor by voting into each of its corresponding k-means centers, i.e. as a sparse vector of similarity weights computed from each cluster center using a Gaussian kernel. Again, such a representation gives poor predictions of pose. Below we describe an alternate encoding that proves to be much more effective. Results from the different representations are presented in§ 6.4.

6.3 Building tolerance to clutter

The dense representation provides a rich set of features that are robust to lighting, slight positional variations and also compactly encode the contents of each patch. However, features from both the foreground (from which pose is to be estimated) and the background (that is assumed not to contain any useful pose information) are represented in a similar manner. Ideally a single learning algorithm would learn to key on the relevant features while not being confused by the clutter in the background to successfully recover 3D pose from such as incompletely specified representation e.g. we have seen that the Relevance Vector Machine, to some extent, is capable of achieving this in the form of implicit feature selection (§ 3.5.1). Here, we do this in a separate phase by re-encoding the patches to explicitly remove irrelevant components from their descriptor vectors. We find that given a training set of foreground-background labeled images, Non-negative Matrix Factorization can be usefully exploited for this purpose.

6.3.1 Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) is a recent method that has been used to exploit latent structure in data to find part based representations [85, 59]. NMF factorizes a non-negative data matrix V as a product of two lower-rank matrices W and H, both of which are constrained to be non-negative:

Vd×n ≈ Wd×pHp×n p≤ d, n Vij, Wij, Hij ≥ 0 (6.2)

If the columns of V consist of feature vectors, W can be interpreted as a set of basis vectors, and H as corresponding coefficients needed to reconstruct the original data. Each entry of V is thus represented as vi=Pjwjhji. Unlike other linear decompositions such as PCA or ICA [158], this

purely additive representation (there is no subtraction) tends to pull out local fragments that occur consistently in the data, giving a sparse set of basis vectors.

These two extra dimensions must usually be appropriately centred and scaled by some weight to balance their effect with respect to the original descriptor. A suitable weight is generally obtained by empirically trying different values.

74 6. Estimating Pose in Cluttered Images

Figure 6.4: Exemplars, or basis vectors, extracted from SIFT descriptors over 4000 image patches located close to the right shoulder. The corresponding block is shown in figure 6.5. (Left) Represen- tative examples selected by k-means. (Right) Much sparser basis vectors obtained by non-negative matrix factorization. These capture important contours encoding a shoulder, unlike the denser examples given by k-means.

In an attempt to identify the meaningful components of the descriptors at each patch location, we collect all the descriptors vl

i from a given location l in the training set into a matrix Vl and

decompose it using NMF. The results of applying NMF to the 128D descriptor space at a given patch location are shown in figure 6.4. Besides capturing the local edges representative of human contours, the NMF bases allow us to compactly code each 128D SIFT descriptor directly by its corresponding vector of basis coefficients, denoted here by h(vl_{), giving a significant reduction in}

dimensionality. This serves as a nonlinear image coding that retains good locality for each patch, and the image is now represented by concatenating the coefficient vectors for the descriptors at all grid locations:

φ(z)_{≡ (h(v}1₎⊤

, h(v2₎⊤

, . . . h(vL₎⊤

)⊤ _(6.3)

Having once estimated the basis W (for each image location) from a training set, we keep it fixed when we compute the coefficients for test images. In our case, we find that the performance tends to saturate at about 30-40 basis elements per grid patch.

An interesting advantage of using NMF to represent image patches is its ability to selectively encode the components of a descriptor that are contributed by the foreground, hence effectively rejecting background. We find that by learning the bases W from a set of clean images (containing no background clutter), and using these only additively (with NMF) to reconstruct images with clutter, only the edge features corresponding to the foreground are reconstructed, while suppressing features in unexpected parts of the image. This happens because constructing the bases from a large number of background-free images of people forces them to consist of components of the descriptors corresponding to consistently occurring human parts. Now when used to reconstruct patches from cluttered images, these basis elements can add up to, at best, reconstruct the foreground components. Some examples illustrating this phenomenon are shown in figure 6.5.

In document H. XI AYUNTAMIENTO DE LOS CABOS, B.C.S. (página 36-41)