Capítulo IV – Estudio económico
Grafico 4.Deficit de la oferta de aulas para educación secundaria
training data, since temporal dynamics in video offers rich information to distinguish objects from background and estimate their shapes more accurately. Souly et al. (2017) employ Generative Adversarial Networks (GANs) in semi-supervised learning for semantic segmentation to leverage available image-labeled data and additional synthetic data to improve the fully supervised methods.
Employing attention maps has been shown to improve weakly supervised se- mantic segmentation (Roy and Todorovic, 2017; Wei et al., 2017). Roy and Todorovic (2017) model visual attention maps using the rectified Gaussian distribution, res- ulting in an improved spatial smoothness of attention maps per object class. Wei et al. (2017) propose an adversarial erasing scheme in order to obtain better attention maps which in turn provide better cues for the training.
The closest related work to the method proposed in Chapter 5 are Wei et al. (2015) and Chaudhry et al. (2017), which also use saliency as a cue to improve weakly supervised semantic segmentation. However, there are a number of differences to the approach proposed in Chapter 5. In contrast to our work, Wei et al. (2015) use a curriculum learning to expose the segmentation convnet with simple images (single object category), and later with more complex ones (multiple objects).For saliency they use a manually crafted class-agnostic method, while we use a deep learning based one, which provides better cues. Their training procedure uses∼40k additional images of the classes of interest crawled from the web; we do not use such class-specific external data. Compared to the work of Wei et al. (2015) we report significantly better results, showing in better light the potential of saliency as additional information to guide weakly supervised semantic object labelling.
Most recently, Chaudhry et al. (2017) have proposed to combine saliency and attention maps to boost performance. They use fully convolutional attention maps to localize the class-specific regions and a hierarchical approach to discover the class-agnostic salient regions to estimate the extent of the object. These two cues are then combined to obtain pixel-level class-specific approximate groundtruth to train a segmentation network. In contrast to the approach proposed in Chapter 5, they use additional supervision in the form of class-agnostic segmentation masks to train a saliency detector and employ a more powerful ResNet architecture (He et al., 2016). The seminal work of Vezhnevets et al. (2011) proposed to use “objectness” maps from bounding boxes to guide the semantic segmentation task. By using bounding boxes, these maps end up being diffuse; in contrast, saliency maps in Chapter 5 provide sharper object boundaries, thus giving better information to guide the semantic labeller.
2.3
instance segmentation
In contrast to instance agnostic semantic labelling that groups pixels by object class, instance segmentation groups pixels by object instances. Instance segmentation is a challenging task because it requires the correct detection of all objects in an image while also precisely segmenting each instance.
and Gool, 2015; Hosang et al., 2015). Some methods first rely on detecting individual objects (Girshick et al., 2014; Dai et al., 2016c; Girshick, 2015; Ren et al., 2015), for which a segmentation mask is then produced. Given a bounding box (e.g. selected by a detector), GrabCut (Rother et al., 2004) variants can be used to obtain an instance segmentation, e.g. (Lempitsky et al., 2009; Cheng et al., 2015a; Taniai et al., 2015; Tang et al., 2015; Yu et al., 2015; Xu et al., 2017).
Earlier methods (Dai et al., 2015b; Hariharan et al., 2014, 2015) make use of bottom- up segments (Pont-Tuset et al., 2016; Uijlings et al., 2013; Krähenbühl and Koltun, 2015, 2014). Hariharan et al. (2014) employ Fast-RCNN bounding boxes (Girshick, 2015) and builds a multi-stage pipeline to extract CNN features and segment the object. This framework was later improved by the use of Hypercolumn features (Hariharan et al., 2015) and the utilization of a fully convolutional network (FCN) to encode class-specific shape priors (Li et al., 2016a). Arnab and Torr (2016) further reason about multiple object proposals to handle occlusions where single objects are split into multiple disconnected patches.
DeepMask (Pinheiro et al., 2015) and follow-up works (Pinheiro et al., 2016; Dai et al., 2016a) learn to generate segment proposals using deep CNNs, which are then classified by Fast-RCNN (Girshick, 2015) and refined to achieve better segmentation boundaries. Similarly, Dai et al. (2016b) propose a complex multiple-stage cascade that predicts instance masks from bounding-box proposals and semantically labels the masks in sequence. Zagoruyko et al. (2016) use a modified R-CNN model (Girshick et al., 2014) to propose instance bounding boxes, followed by further refinement to obtain instance level object masks. Ultimately, these approaches suffer from the fact that they predict a binary mask within the bounding box proposals, making the system slower and less accurate.
Li et al. (2017c) propose to combine the object detection approach of Dai et al. (2016c) and the segment proposals of Dai et al. (2016a) for fully convolutional instance segmentation (FCIS), predicting a set of position-sensitive output channels fully convolutionally. These channels simultaneously address object boxes, masks and semantic classes, making the system fast. However, this approach might experience errors and forged edges on overlapping instances. Bai and Urtasun (2017) combine intuitions from the classical watershed transform and deep learning to produce an energy map of the image where object instances are represented as energy basins. This method has constant runtime regardless of the number of object instances.
Most recently, Mask-RCNN (He et al., 2017) extends Faster-RCNN (Ren et al., 2015) by adding a branch for predicting segmentation masks on each Region of Interest (RoI) in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting an object mask in a pixel-to-pixel manner. The parallel prediction makes the system simpler and more flexible.
In Chapter 4 we explore weakly supervised training of an instance segmentation convnet. To the best of our knowledge there is no previous work on predicting object masks in a weakly supervised fashion. We use DeepMask (Pinheiro et al., 2015) as a reference implementation for this task. In addition we re-purpose the DeepLabv2