• No se han encontrado resultados

Next, we performed a set of experiments using the Photorealistic Virtual City (PVC) dataset [147]. The PVC dataset is a synthetic wide baseline stereo dataset that was designed to study how feature matching performance using local feature descriptors degrade given controlled changes in the light- ing, scene, and viewing conditions. Unfortunately, dense ground truth is difficult and expensive to gather for such evaluations in controlled manner. Instead, this dataset uses a photorealistic virtual world to gain complete and repeatable control of the environment in order to evaluate image features. Raytraced rendering is used to study the effects on descriptor performance of controlled changes in viewpoint and illumination. This synthetic dataset has been validated by comparing matching per- formance on rendered imagery and comparing matching performance to actual imagery of the same scene, and results have shown approximately equivalent performance. This justifies the use of a synthetic dataset to predict performance on natural imagery.

Figure 3.17: Example ground truth correspondence for rendered imagery in the PVC dataset. Shown are a random subset of five hundred pixels such that colors encode corresponding pixels between the right and left images. (top left) camera 1 with translation correspondence (bottom left) camera 2 with rotation correspondence (top right) camera 3 with translation and rotation correspondence (bottom left) camera 4 with translation and rotation correspondence.

environment. Each scene is represented by an image sequence of a translating camera along a city street, such that each image overlaps. At each translation, the camera also rotates by plus and minus 22.5 degrees to provide both orientation change and translation change. Finally, at each camera pose, the lighting is varied by time of day for a sunny august day for five different times at two hour intervals from 9am to 5pm. This provides controlled lighting changes for each scene. No additional noise is added to the rendered imagery, which provides an idealized controlled scenario to evaluate matching using local feature descriptors.

Figure 3.17shows example imagery from this dataset. We show the ground truth correspon- dence between a subsampled set of pixels in the right image and the corresponding pixels in the left image, determined from the ground truth range from the virtual city. Correspondences are encoded by color, such that red pixels in the right match red pixels in the left. This image shows examples from each of four cameras, with correspondences consistent with translation only, rotation only and translation and rotation. These images are all shown at the constant time of day of 9am.

The experimental protocol for evaluation on this dataset was greedy matching score given exact correspondence. For each overlapping image pair, we extracted the ground truth correspondence for each pixel in the right image to the corresponding pixel in the left image. We selected a set of 500 correspondences at random, and extracted local feature descriptors for the corresponding

locations in both the left and right image. The descriptors are computed at canonical scale and rotation. Finally, we compute an exhaustive pairwise distance computation, and perform greedy bipartite matching to assign matches from the right to the left. We define a correct match to be a match to within 10 pixels of the ground truth correspondence. The matching score is the total number of correct matches divided by the total number of matches.

This experimental protocol enables isolated analysis of the effects of descriptors only on match- ing performance. Recall that the VGG-Affine dataset evaluation includes both local feature de- tectors, which provides affine invariant keypoints with local feature descriptors to compute a final matching score. This score is affected by the quality and accuracy of the keypoint extraction, which can conflate the effects of the matching score with the descriptor performance and the detector per- formance. In this evaluation, we decouple the detectors and descriptors by including the ground truth correspondences in as ”detectors”, then compute the descriptors for these ground truth corre- spondences. Therefore, the matching performance is a function of the descriptors only. This allows conclusions to be drawn about the effect of the descriptors only on matching performance.

In this section, we show the matching performance as a function of translation, rotation or translation and rotation. Furthermore, we show the mean matching perfomrance and the matching performance as a function of time of day. We compare performance of the nested shape descriptor with DAISY [42], SIFT [35], ORB [45], BRISK [43] and FREAK [46]. In this descriptor com- parison, DAISY and SIFT are real valued descriptors, while ORB, BRISK, FREAK and NSD are binary valued. The DAISY descriptor was specifically designed and optimized [41] for wide base- line stereo matching. Our results show that NSD outperforms all descriptors in all experiments, which provides a basis of confidence for concluding that the NSD is a state-of-the-art descriptor for wide baseline stereo matching. Our experimental evaluation code is available for download at https://github.com/jebyrne/PhotorealisticVirtualCity.

3.4.5.1 Translation Evaluation

First, we performed an evaluation for wide baseline binocular stereo. For each camera, we consider pairs of images that overlap and that are related by a translation only, such as the correspondences shown in figure3.17(top left). This scenario models a calibrated and rectified wide baseline stereo configuration such that epipolar lines are aligned with scanlines. Figure3.18shows the mean match-

Figure 3.18: Photorealistic Virtual City - Translation only results. (left) Aggregate (right) Time of day

ing score for each descriptor over all cameras and time of day, as well as the mean matching score as a function of time of day. In all cases, our NSD outperforms all local descriptors in all scenarios, including the DAISY descriptor that was specifically designed for wide baseline stereo matching.

Figure 3.19shows the detailed matching performance for each pair of overlapping images at at given position for each camera in the dataset. The plots in figure 3.18(left) were constructed by computing the mean over all four plots in this figure. This shows that the NSD is consistently outperforming the other descriptors across all cameras.

3.4.5.2 Rotation Evaluation

Next, we performed an evaluation for a rotational homography. For each camera, we consider pairs of images that are formed by rotating the camera by plus and minus 22.5 degrees in yaw. An example of this rotation scenario is shown in figure 3.17(top right). Figure 3.20(left) shows the mean matching score over all cameras for each descriptor, and Figure3.20(right) shows the mean matching score as a function of time of day. In this scenario, NSD outperforms all other descriptors, however the performance of DAISY is quite close.

Figure3.21shows the matching performance per image for each camera. This shows that the NSD is consistently outperforming DAISY on each image and not just in aggregate performance.

Figure 3.19: Photorealistic Virtual City - Translation only results per location

3.4.5.3 Translation and Rotation

Next, we performed an evaluation for a combined rotational homography and translation. For each camera, we consider pairs of images that are formed by rotationing the camera by plus 22.5 degrees then translating the camera and rotating by -22.5 degrees. This scenarios is the combination of the translation only and rotation only cases evaluated above.

Figure3.22(left) shows the mean matching score for each descriptor over all cameras. Figure 3.22(right) shows the mean matching score for each descriptor as as function of the time of day.

Figure3.23shows the detailed results for translation and rotation. This shows that the NSD is consistently outperforming DAISY on each image and not just in aggregate performance.

Figure 3.20: Photorealistic Virtual City - Rotation only results. (left) Aggregate (right) Time of day

Documento similar