• No se han encontrado resultados

Relación de los indios con la naturaleza en Arguedas

CAPÍTULO II. La justicia medioambiental en Yawar fiesta , Los ríos profundos , Todas

1. La justicia medioambiental

3.3. Problemática de la justicia medioambiental en Todas las sangres

3.3.3. Relación de los indios con la naturaleza en Arguedas

Fig. 4.1 and Fig. 4.2 compare the results of the proposed approach with those of normalized cuts with post-segmentation segment matching. The segmentation obtained by normalized cuts is inconsistent across frames. Our method significantly outperforms both normalized cuts and GMM-EM1 and returns video segmentations that are both of high quality and 1The GMM-EM model we use for comparison in our experiment is closely related to the model of Khan

and Shah [KS01], with the main difference that no information of local velocity is used in the clustering. The segmentation obtained by GMM-EM in our comparisons is consistent across frames but is of poor quality due to the complex shapes of the segments.

Ballet video sequence

Ncut

Our Good

OK Bad

Figure 4.3: Human Ratings. Six people rated the video segmentation results of a subset of all the frames in the “ballet” sequence. As for the results in Section 3.5) the possible rates were: good, OK, or bad. The plots show the rating statistics for the SPMM with video coherence (top bar) and for the normalized cut (bottom bar). Each bar is split into three parts whose sizes correspond to the fraction of images assigned to the corresponding rating. Better overall performance corresponds to less red and more blue. Our method outperforms clearly outperforms normalized cut.

consistent across frames (i.e. the same object is consistently assigned to the same clus- ter, denoted by same color, throughout the whole video sequences). Fig. 4.3 shows the human ratings for the ballet sequence (see Fig. 4.1). For this quantitative assessment of segmentation quality, the SPMM greatly outperform the normalized cut method2.

For sanity check, we also compared segmentation results of our method with and with- out temporal coherence. Using temporal coherence significantly improved the segmenta- tion quality. Please refer to supplemental material of [AZMP07] for the complete video sequence as well as other videos.

2For the video sequence of Fig. 4.1 the GMM-EM method fails to converge. Therefore, no human ratings

Chapter 5

Segmenting Image Collections

We can extend the probabilistic model of Chapter 3 for the simultaneous segmentation of an image collection. When all the images in the collection share objects that have similar characteristics (see Fig. 5.2, top row) we can improve the segmentation by sharing infor- mation across images. For example, in Fig. 5.2, since all the pictures show a person’s head (and shoulders), it is possible to use the consistency of these elements’ appearance (color, shape, position) across images to improve segmentation quality, as well as provide coherent segment labels across images.

5.1

Semi-parametric LDA model (SP-LDA)

Hence, we propose the new probabilistic model of Fig. 5.1, whereK segments are shared across a collection ofM images. These shared segments are described by the distributions

fs

k, with k the segment label and the superscript sindicating the distribution is “shared”. We also assume that each image hasH additional segments that are not shared across the collection. These image-specific segments are described by the distributions fns

h,m where

c N θ α m K f fk s mn m M ns H h,m xmn

Figure 5.1: Semi-parametric Latent Dirichlet Allocation model (SP-LDA) for joint segmen- tation of image collections (see Section 5.1). As in Fig. 3.1, the gray nodexmnrepresents the observed quantities (features vector n for image m in the collection). The node cmn represents the segment assignment for the observation xmn. The node θm represents the mixing coefficients for each segment in imagem. The rounded boxαis the hyperparam- eter of the Dirichlet distribution ofθm. The inner plate represents the Nmpixels in image

m, while the outer plate represents all the M images in the collection. TheK distributions

fs

k model the recurring objects in the collection and are shared across all the images. The

H distributionsfns

h,m are local to each image, i.e., independent of the rest of the collection, and represent the image-specific segments.

Since these distributions are not shared across images we use the the superscript ns for them. GivenK andH the total number of segment in each image isK +H. If we set the number of shared segmentsK to zero we obtain the single image case, while ifHis set to zero then we are enforcing all the segments in an image to be shared in the collection; in Section 5.2 we will explore the effect of different choices.

We represent both the shared distributions fs

k and the image-specific ones fl,mns using the semi-parametric representation described in Section 3.2.3. We call the probabilistic model of Fig. 5.1 with the semi-parametric representation semi-parametric latent Dirich- let allocation (SP-LDA). For the shared distributionsfs

as providing a prior or bias toward a particular region of the feature space (the position and color of pixels segments). This bias represents appearance and shape properties of the common objects in all the images.

To perform inference, we use the sampling method developed for the single-image case (see Section 3.4.1), with the exception that the parameters of the Gaussian terms of shared segments are computed using observations from all the images. The non-parametric terms of the shared segments are computed independently for each image as for the single-image algorithm.

5.2

Experiments

To study the performance of the SP-LDA model of Fig. 5.1 we consider a collection of

30images, all showing the face (and the shoulders) of different people in different indoor scenes (varying background). To determine which parts of the image are assigned to a shared segment and which parts to a not-shared segment, we test different values of K

(number of shared segments) andH (number of image-specific segments).

Fig. 5.2 shows six images from the collection (first row), their ground truth segmen- tation (second row)1 of the face (blue segment), and several segmentation results for dif- ferent values of H and K. When no information is shared among the images (third and

1The ground truth considers only the face and disregards other parts of the person like the neck and the

In p u t Im ag e G ro u n d tr u th S eg m en ta ti o n N o S h ar in g (K = 0 , H = 2 ) S h ar in g (K = 1 , H = 1 ) S h ar in g (K = 2 , H = 0 ) N o S h ar in g (K = 0 , H = 3 ) S h ar in g (K = 2 , H = 1 ) S h ar in g (K = 3 , H = 0 )

Figure 5.2: Segmenting an image collection. First row: six examples out of a collection of 30 images of faces on different backgrounds. Second row: corresponding ground truth segmentation of the face. Rows three to five: binary segmentations with different numbers of shared segments. Rows six to eight: segmentation in three segments with different number of shared segments. K is the number of shared segments andH is the number of image-specific ones.

sixth rows) the resulting segmentation is not precise in selecting the face. Often it merges the face with part of the scene background, particularly when only 2 segments are used (third row). Moreover, the segment containing the face is not consistently labeled across the image (see sixth row). When one or more segments are shared across the images, they are assigned to the recurring elements of the collection: the face and the shoulders. This results in both an improvement in the segmentation of the face and a consistent labeling of

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 Recall Precision F=0.1 F=0.2 F=0.3 F=0.4 F=0.5 F=0.6

Figure 5.3: Precision/recall for the face collection. Different markers correspond to the per- formance of the SP-LDA model (Fig. 5.1) for different settings of the parametersK(num- ber of shared segments) and H (number of image-specific segments). The green curves correspond to precision/recall values with the same harmonic mean (F measure [Rij79]).

the segment of a recurring object across different images. In particular, when one segment is shared and one is image-specific (K = 1,H = 1) the face and the shoulders are almost always assigned to the shared segment (yellow), while the remaining part of the scene is assigned to the image-specific segment (red) as shown in the fourth row. When there are two shared segments and an image-specific one (K = 2,H = 1) the segmentation of the face improves further. One of the shared segments captures the faces (red) and the other the shoulders (yellow), which are no longer grouped together with the face (seventh row). Again the rest of the scene is assigned to the image-specific segment (green). Finally, we observe that forcing all the segments to be shared (fifth and eighth rows) results in worse segmentation than the case with image-specific segments. This is most likely a result of the mismatch between the model, which assumes all segments are recurring, and the dataset which shows faces (a recurring object) on varying backgrounds.

The qualitative observations for Fig. 5.2 are confirmed by the precision/recall results presented in Fig. 5.3. Without sharing (i.e., setting K = 0) we have the lowest perfor-

mance2(black and magenta circles). These results are almost equivalent to a random guess,

since the face will have random labels across the images. Performance improves when we share information for some segments, and one segment is image-specific. In particular the

K = 2,H = 1case gives the best results (red triangle). Finally, for a fixed number of total segments, sharing all the segments (green and cyan crosses), i.e., settingH = 0, always results in worse performance than keeping one segment image-specific, i.e., H = 1. This can be seen by comparing the positions of crosses and triangles.

The computational cost of performing inference on the model of Fig. 5.1 is linear in the number of images and in the total number of segments K +H in each image. Using our C++ implementation of the sampler it takes about 185 sec. per image per segment on a 2.50GHz Intel Xeon machine.

The SP-LDA model can to handle images like the ones in Fig. 5.2. For more complex situations, with many more recurring objects that might not appear in all the images of the collection, the inference algorithm for the SP-LDA fails to converge. For this more general problem we present a new model in Section 6, that can handle variable content in images and is capable of modeling the appearance of more general categories.

2To decide which segment label corresponds to the face segment, we select the segment with the largest

overlap with the ground truth. However, when a single segment is shared we assume that segment to corre- spond to the face segment.

Chapter 6

Learning Categorical Segments in Image

Collections

In the SP-LDA model of Section 5 we used mean and covariance of the semi-parametric distributions as shared statistics for the position/RGB value across images. For the collec- tion of faces we considered in our experiments this is a good modeling choice since the recurring object (the face) has similar shape and color in all the images. However, for re- curring objects with textured appearance and varying position and shape, a more complex representation is required.

6.1

Modeling recurring segments

Inspired by the “bag-of-words” approach [FFP05, SRE+05] we extend the model in Fig. 3.1 by adding new observed variables wmn that represent the visual words associated with an observation. These new discrete random variables are sampled from K different multino- mial distributionsφk(topic distributions) which model the visual words’ statistics for each of the K segments. Fig. 6.1 shows the graphical representation of the extended model. The model represents a collection ofM images. An image is represented byNm regularly

K

θ

M

m

N

m

K

φ

k

ε

α

c

mn

w

mn

x

mn

f

k,m

Figure 6.1: The affinity-based LDA model (A-LDA) for learning categorical segments (see Section 6). The two gray nodes xmn and wmn represent the observed quantities in the model: the feature vector (position and color) and the visual word associated with each pixel, respectively. The nodes cmn, fk,m, φk, andθm are hidden quantities that represent the segment assignment for xmn and wmn, the probability density of the feature vectors in segment k of image Im, the visual words distribution for segment k, and the sizes of the segments in image m, respectively. The two squares with rounded corners α and ε

represent the hyperparameters of the Dirichlet distributions overθm andφk, respectively. Finally, K is the number of segments, Nm is the number of pixels in imagem, andM is the number of images in the collection.

spaced observations (e.g., one sample per pixel). At then-th observation of imagem we measure a feature vectorxmn, e.g., the pixel’s position and RGB values. We further extract a fixed size image patch centered at the n-th pixel and assign to it a “visual word” wmn. In our implementation, the dictionary of visual words is obtained by vector-quantizing a subset of all the descriptors of the patches extracted from all the images. Thewmnvariable of an observation is the label of the dictionary entry closest to the descriptor associated to the observation.

Each image is formed byKregions (segments) whose visual words statistics are shared across images. Segmentkin imagemhas a probability distributionfk,mof feature vector

of an object, which is captured by the φk distributions, is similar in all images. On the other hand the position of an object in a particular image can be assumed independent of the position in other images. For example, a car can appear in various image locations. However, its overall appearance, as described by the visual words, is the same in all im- ages. We model the segment distributionsfk,m using the nonparametric model proposed in Chapter 3, while forφk we use an LDA model, as proposed in [FFP05] and [SRE+05]. Thus if we remove thexmn node from the graphical model we obtain the LDA model. Re- moving thewmn node from the model yields a collection of M independent models, like the ones described in Chapter 3. We call this new model affinity-based latent Dirichlet al- location (A-LDA) since we are using the affinities between pixels (see Eq. 3.3) to describe the segment distributionsfk,m.

In the A-LDA model, visual words are grouped by segments. This enables learning top- ics that are related to object parts rather than to whole scenes, as is done with the “bag of words” representation of whole images [FFP05]. A key aspect of the proposed model is that the densitiesfk,mallow grouping of all the visual words generated from the corresponding topic distributionφk into a single image segment. Moreover, it is possible to enforce dif- ferent grouping properties by choosing different forms for the densitiesfk,m. Assuming a Gaussian distribution over the pixel positions in the image, as in Sudderth et al. [STFW05], results in a spatially elliptical cluster of visual words generated from the topicφk. Assum-

ing a non-parametric distribution (see 3.2.1), results in a more complex grouping based on color information as well as position in the image.

An important remark is that the A-LDA model assumes that the feature vectors xmn and the visual wordswmn of a given pixel are independent given the topic assignment for the pixelcmn. It also assumes that visual words are independent given their hidden labels. These two assumptions are theoretically incorrect. The two random variableswmnandxmn are correlated, since both depend on the image patch centered on pixeln. The same is true for the visual words of close (overlapping) patches. However, ignoring these dependencies results in a simpler probabilistic model.

The densitiesfk,mand the distributionsφkhave complementary roles in the model. The densityfk,mmodels segmentkin a specific imagem, and it forces pixels with high affinity to be grouped together. The multinomialsφkcouple together segments in different images of the collection, i.e., they force segments in different images to have the same visual words statistics. All the multinomial coefficients of theφkare sampled from the same prior distribution — a symmetric Dirichlet distribution [BNJ03] with (scalar) parameterε:

φk∼Dir(ε)

wmn|φk ∼Multinomial(φk). (6.1) TheK topic/segment distributions are not image-specific like the densitiesfk,m, but rather are shared within the entire collection. This allows coupling segment appearance statistics across multiple images based on the distribution of visual words they contain. However, in a particular image of a collection there may be objects that do not appear in other images.

model gives similar results to the one of Fig. 6.1.