• No se han encontrado resultados

Capítulo 2 Marco metodológico

2.3 La selección de la muestra

This section describes our method of detecting structural edges from the scene. To extract edges from the scene, we require a spatial operator to perform detection on the minimal depth encoding. The desired operator must account for all surface shapes, such as two corrugated iron fences that abut at an angle, or a corner in rippled curtains (see Figure 4-6). In addition, sensor noise is complex and scale is problematic, in short, “mathematics has nothing to say about scale” - O. Faugeras [47]. A rippling curtain does have changing curvature, but it is the joint between the surfaces that would be considered structurally salient by humans for most tasks (see Figure 4-6). For these reasons, low level edge operators such as surface irregularities are generally unsuitable for the task.

Hence we take the approach of applying a deep CNN that incorporates high level semantic and context information into account as the spatial operator to detect struc- turally salient edges. An advantage of deep CNNs for such problems is that the encoding weighs depth values from the entire image and so supports a multi-scale framework. Further, contour processing generally employs a broader region of sup- port to suppress noise as well as a local gradient operator to find the edge.

4.4.1 Network Architecture

We use the VGG-16 fully convolutional network as the base architecture for testing the DSD encoding. Since our main contribution is the DSD encoding, the selection of VGG-16 provides fair comparison of this encoding with existing methods. We trim the fully connected layers of VGG and incorporate deep supervision by adding a side output to the last convolutional layer of each of the five VGG blocks, as in [157].

(a) RGB (b) DSD (c) HHA [60] (d) GroundTruth (e) Ours (DSD) (f) HHA [60]

Figure 4-4. Edge maps obtained by applying the Canny filter applied to the DSD and HHA [60] depth encodings, with the best Canny parameters found via grid search. This gives a measure of how effectively each feature encodes geometric surface information for edge detection. Note that in the DSD encoding, surfaces bordering structurally significant boundaries generally have greater contrast.

We will now give a brief overview of the objective function of the network. For more details, please see [157].

Let W denote the collection of standard network parameters. Suppose we have

𝑀 side output layers, where each side output is associated with a classifier with

corresponding weightsw= (w(1), ...,w(𝑀)). For a given input𝑋 ={𝑥

𝑗, 𝑗 = 1, ...,|𝑋|} and ground truth 𝑌 = {𝑦𝑗, 𝑗 = 1, ...,|𝑋|}, the image-level loss function for the side outputs is given by:

ℒside(W,w) = 𝑀 ∑︁ 𝑚=1 𝛼𝑚𝑙 (𝑚) side (︀ W,w(𝑚))︀ . (4.4)

The individual loss function 𝑙(side𝑚) for side output 𝑚 is defined as the balanced cross-

entropy loss: 𝑙(side𝑚)(︀W,w(𝑚))︀=−𝛽 ∑︁ 𝑗∈𝑌+ log Pr(︀𝑦𝑗 = 1|𝑋;W,w(𝑚) )︀ −(1−𝛽)∑︁ 𝑗∈𝑌− log Pr(︀𝑦𝑗 = 0|𝑋;W,w(𝑚) )︀ , (4.5)

where 𝛽 = |𝑌+|/|𝑌|, with |𝑌+| denoting the edge ground truth label set.

Pr(︀𝑦𝑗 = 1|𝑋;W,w(𝑚)

)︀

is computed as the sigmoid function 𝜎(𝑎(𝑗𝑚)) ∈ [0,1] on the

activation value at pixel 𝑗. For each side output layer, this gives an edge map pre-

diction 𝑌ˆ(𝑚) side =𝜎( ˆ𝐴 (𝑚) side), where𝐴ˆ (𝑚) side ≡ {𝑎 (𝑚)

𝑗 , 𝑗 = 1, ...,|𝑌|} are the activations of the side output of layer 𝑚.

The side output predictions are combined by adding a weighted fusion layer to the network and simultaneously learning the fusion weight during training. The loss function for this fusion layer is given by:

ℒfuse(W,w,h) = Dist(𝑌,𝑌ˆfuse)

where 𝑌ˆfuse 𝜎(∑︀𝑀

𝑚=1)ℎ𝑚𝐴ˆ

(𝑚)

side where h = (ℎ1, ..., ℎ𝑀) is the fusion weight. Dist(·,·) is the distance between the prediction and ground truth map, which is set as the

Output Ground Truth DSD Feature Input convolution+batchnorm+ReLU max pooling

deep supervision cost side output fusion VGG16

Figure 4-5. An overview of our edge detection system. Our DSD encoding of the depth map is the input to fully convolutional VGG16 network with deep supervision, as in [157]. We add a batch normalization layer after every convolutional layer to speed up convergence.

cross-entropy loss. The combined loss function for the network is thus given by:

ℒ(W,w,h) =ℒside(W,w) +ℒfuse(W,w,h)

During training this objective function is minimised via standard (back-propagation) stochastic gradient, using batch normalization to speed up convergence. An illustra- tion of the network architecture is shown in Figure 4-5.

We merge the output depth edge maps with rgb maps from the HED architecture [157] in order to assess the contribution of the system as part of an RGB-D edge detector. When merging depth edge with rgb edge maps, we first take the product of the fusion output with all the up-sampled side outputs, since this produces the best results. We observe that the later side outputs produce more semantically meaningful output with some false positives due to blurry edges from up-sampling, whereas the earlier side outputs have excellent edge localization but a high number of false posi- tives due to incorrect edge detections within non-boundary regions. Thus taking the

product of all layers reduces false positives while ensuring that the meaningful edges retain a high response. Multiplying the side outputs in this way increases F-score but decreases average precision. However, when merging with the rgb saliency map, average precision is not reduced.