Mínimo 6 meses de experiencia en:
III. Evaluación de la Experiencia y Valoración del Mérito;
As in the original PTAM [54], the tracker is responsible for real-time camera pose estima- tion and selecting the keyframes to be used for map construction. Figure 5.5 describes the tracking procedure in detail. To start the tracker requires an estimate of the current camera pose, in the original PTAM this estimate is generated by applying a decaying velocity motion model to the previous camera pose estimate. In CCTAM a more accu- rate estimate of the current pose provided via forward prediction using the EKF and
Figure 5.5: The CCTAM tracking process
MAV motion model (described in Section 5.5.3). From this pose estimate the tracker then determines which map-points should be visible in the current camera image. This procedure has the largest effect on the tracking runtime as it scales linearly with the size of the map. However as will be demonstrated in practical experiments tracking time remains near constant for map sizes of ≈ 300 keyframes and ≈ 20000 map-points. This is a sufficient size with which to map an area of 20 × 20 metres (also demonstrated in the experimental evaluation).
The tracker will then create an image pyramid from the current frame and extract visual features at each level of the pyramid. CCTAM makes use of the AGAST [73] detector to extract corner features for each level. As detailed in Section 2.3.1 it is a more general detector than the original FAST and as such does not require re-training to maximise performance. Additionally the dual decision tree approach makes the detector more robust to self-similar structures. Self-similar structures such as repeating patterns
in indoor environments or grass and tarmac in outdoor environments present a problem for Visual SLAM as they may lead to bad correspondences. The AGAST detector with its two tree approach provides some level of robustness to such structures.
The tracker uses a two-stage tracking process; first a small set of map-points (50-100) which should appear in the coarsest levels of the image pyramid are searched for. This is done using the estimated camera pose and 3D position of the map-point to re-project the point into the current image using the projection function described in Section 2.2. If an AGAST corner in found within small radius of re-projected image coordinates it is a possible match for the map-point. To verify this an 8 × 8 patch around the detected feature point is compared to the corresponding patch in the source keyframe for that particular map-point. However because the viewpoint may have changed from the original keyframe an affine warp is applied to the source patch [54]. The affine warp matrix A is given by:
A = "∂u t ∂us ∂ut ∂vs ∂vt ∂us ∂vt ∂vs # (5.1)
where (us, vs) are the pixel coordinates of the source pixel and (ut, vt) are the pixel
coordinates of the target pixel. This is computed by projecting unit pixel displacements from the plane of the source patch to the current target frame. Determining in which pyramid level a map-point should be searched for is done by taking the determinant of the matrix A; the determinant corresponds to the area of the patch in square pixels that a source pixel occupies in the original image resolution. Therefore if 4 pyramid levels are used the correct pyramid level to find the patch is given by det(A)/4. The warped patch is then compared to the target patch using the Zero-Mean Sum of Squared Differences (ZSSD) to provide some robustness to lighting changes. This procedure is repeated for all AGAST features within a small region of the predicted image coordinates and the feature with the lowest ZSSD that is beneath a predefined threshold is taken as a match. Each match represents an observation of a map-point for which we have a estimate of the 3D position. This gives a set of 3D world point to 2D image points this is exactly the perspective-n point problem described in Section 2.4.1. To obtain the most accurate solution the linear solution is not employed; instead the problem is solved by using non-linear least squares, minimising the sum of the re-projection error as discussed in 2.4.4. To improve robustness to outlier the standard re-projection error is replaced by one of the robust cost functions described in Section 2.4.8. In this work we used the Tukey cost function however in our tests similar performance is obtained with the Huber and Cauchy cost functions. Once a pose update has been successfully computed on the small set of coarse features a fine-grained search is carried out on a larger set of points (1000-5000) from all pyramid levels. After the final pose update is complete the tracking quality is assessed to determine if re-localisation is necessary. The tracking quality heuristic is based on the fraction of successful observations of map-points which should be visible in the current frame. Tracking quality is also used to determine if a
Figure 5.6: The CCTAM mapping process
new keyframe should be added to the map. If the tracking quality is high enough and the distance to the nearest keyframe is sufficient a new keyframe will be added to the map.