4 ¿QUÉ APORTE LE HACE LA TEORÍA DEL DAÑO SOCIAL A ESTOS DOS FUNDAMENTOS DE LOS

In this section, we conduct a comprehensive comparison on the quality of proposals generated by three of our main methods. Specifically, Table 6.6 shows the results of our ACCV work (Charpter 4), WACV work (Charpter 5) and ICCV work (Charpter 6) on the Cityscapes dataset. We can see that our ICCV work achieves the best performance. With 1,000 proposals, it improves the AR of initial SharpMask proposals by 25.7% (from 0.160 to 0.201), which is almost 2x the 13.8% improvement (from 0.160 to 0.182) obtained in our WACV work. This indicates that our FFD deformation network is much more powerful than the previous affine transformation regression network. Besides, the performances of our last two methods are obviously better than our ACCV work’s. This demonstrates that our series of works have gradually advanced the performance of object segment proposal generation. Note that we compute the AR between IoU 0.5 and 1 for this comparison, to be consistent with the evaluation metrics used in previous works.

6.4 Conclusion

In this work, we address the problem of object-mask registration and aim to align a shape mask to a target object instance. To this end, we take a transformation based approach that predicts a 2D non-rigid spatial transform and warps the shape mask onto the target object. In particular, we propose a deep spatial transformer network that learns free-form deformations (FFDs) to non-rigidly warp the shape mask based on a multi-level dual mask feature pooling strategy. Our network is fully differentiable and thus can be trained in an end-to-end manner. We evaluate our FFD network on the task of refining a set of object segment proposals, and our approach achieves the state-of-the-art performance on the Cityscapes, the PASCAL VOC and the MSCOCO datasets.

§6.4 Conclusion 95

Figure 6.8: Qualitative examples for segment proposal refinement on Cityscapes. Red: original object mask. Green: aligned mask.

96 Deep Free-Form Deformation Network for Object-Mask Registration

Figure 6.9: Qualitative examples for segment proposal refinement on Cityscapes. Red: original object mask. Green: aligned mask.

§6.4 Conclusion 97

Figure 6.10: Qualitative results onPASCAl VOC. Red: original object mask. Green: aligned mask.

98 Deep Free-Form Deformation Network for Object-Mask Registration

Figure 6.11: Qualitative results onPASCAl VOC. Red: original object mask. Green: aligned mask.

§6.4 Conclusion 99

Figure 6.12: Qualitative results on MSCOCO. Red: original object mask. Green: aligned mask.

Chapter7

Conclusion and Future Direction

In this thesis, we mainly investigate the problem of object proposal generation. We have developed and implemented several algorithms to generate better object proposals, especially segment proposals. This final chapter summarizes the main contributions of this thesis and closes with suggestions of possible directions for future work.

7.1 Main Contributions

Object proposal generation has become a critical step in many compute vision tasks like object detection and object instance segmentation etc. This thesis extends the object proposal generation to stereo images, proposes incorporating geometric information, semantic context and representation learning into proposal generation, as well as develops two transformation-based methods to refine segment proposals. In particular, we focus on three main aspects in the problem of object proposal generation: 1) generating object bounding box proposals for stereo images with geometric features and semantic context, 2) generating object segment proposals for stereo images with learning representations and learning grouping process, and 3) learning to warp object segment proposals.

We first consider the problem of generating bounding box proposals with addi- tional geometric information and semantic context for stereo images in Chapter 3. We compute a new objectness score for each initial bounding box proposal based on three types of features, including a CNN feature, a geometric feature computed from the depth map and a semantic context feature from pixel-wise scene labelling. We train an efficient random forest classifier to predict the objectness score. To refine the location of the proposal, we also learn a set of bounding box location regressors to fine-tune the positions of the re-ranked object proposals. We evaluate our method on the KITTI dataset and achieve high recall rate with a fraction of the initial proposals, outperforming the state-of-the-art.

In Chapter 4, we move our focus to the problem of generating object segment proposal for stereo images. We propose to exploit both deep features and depth cue in segment proposal generation. For each image region, we extract a descriptor from convolutional feature maps and geometry maps to describe it, which encodes the

102 Conclusion and Future Direction

image region with multi-level and multi-modal information. We learn a similarity network to estimate the affinity between two adjacent regions, and based on the pre- dicted affinity score, we sequentially merge regions from a segmentation hierarchy to produce segment proposals. We also learn a ranking network to predict the objectness score for each segment proposal. The learned representation and perceptual grouping strategy bring significant boost to the performance of segment proposal generation. Experiments on the Cityscapes dataset show that our approach achieves much better average recall than the state-of-the-art and depth cue can improve the ranking of proposals.

To generate better object segment proposals, an alternative approach is to refine an initial set of object segments. Chapter 5 presents an efficient object segment refinement method that learns spatial transforms to improve the pixel-level accuracy of the object proposals. We design a new mask pooling strategy to encode the mis- alignment between the segment mask and the object region. We apply the mask pooling to the hypercolumn feature maps and extract features at different levels for each segment mask. Based on the features, we design and train a deep network to predict the affine transformation parameters to warp the initial segment masks to- wards groundtruth object regions. We evaluate our approach on the Cityscapes and the PASCAL VOC datasets. The results demonstrate that our method can consis- tently achieve improvements on the IoU quality of the object segment proposals over state-of-the-art methods.

In Chapter 6, we propose a deep learning approach to address the object-mask alignment problem and apply it to the task of refining a set of segment proposals. Aligning a shape mask to object instances is a commonly used strategy in object segmentation, which can also be used in object proposal generation. We build a deep free-from deformation (FFD) network to solve this problem. Our FFD network learns a non-rigid 2D transform that warps the mask onto the target object. It consists of two modules. The first module computes multi-level features based on a dual mask feature pooling method to encode the shape information of the initial mask and the image cues around the object. The second module predicts a non-rigid transform through regression, and then applies the transform to the initial mask, based on a grid generator and a bilinear sampler, to produce the final warped object mask. Both of the modules are differentiable, making the entire network can be trained in an end-to-end fashion. We evaluate the FFD network on the task of refining a set of object segment proposals. Experiments on the challenging Cityscapes, PASCAL VOC and MSCOCO datasets show that our approach achieves the state-of-the-art performance.

In document DERECHOS económicos, sociales y culturales (página 140-145)