7.5 Cuantificación del riesgo
7.5.1 Escenario “sin riesgo”:
show a density change over the pixels. The geometry problem in the z-axis is a well known problem that occurs in computer vision [122]. WIDE discriminates noise, but at the cost of missing other desirable detections. Each of the four techniques that have been analysed over the last two sections have positives in some scenarios and drawbacks in others. The most consistent is the greyscale technique, but it introduced noise into the walkway scenario, as well as interfering with the person in the occlusion scenario. For a static camera environment, the use of each technique is scenario specific.
The experiments in sections 3.3 and 3.4 are applicable to the static camera environment, and there are many varying background subtraction techniques available for this application some of which are discussed in chapter 2. The occlusion scenario, figures 3.4, 3.8, 3.12, had noise introduced by the camera shake (the platform was moving, such that the background appears as a detection with the background subtraction techniques). The objective to use detection algorithms on UAV will also encounter the movement problem, as the aircraft is not still and will be traversing across terrain. This leads to a more complex scenario to consider; detecting objects in a moving camera. Background subtraction will detect the background as moving well as objects in the foreground. The next set of experiments explores the technique of motion estimation, which is a technique of warping consecutive frames into the same perspective and stitching them together so that background subtraction can be performed on the static overlapping areas of the frames.
3.5
Motion Estimation Accuracy
As described in Chapter 2 motion estimation warps two or more consecutive frames into the same perspective, and stitches them together to create a static overlapping region. Au- tonomous Real-Time Object Detection (ARTOD) is a method that extends the motion esti- mation to include background subtraction [126], applied to the static region. This approach uses the Recursive Density Estimation (RDE) background subtraction method [5]. Motion estimation introduces artificial noise at the stitching boundary; at the 3-sigma threshold, RDE excludes most of the artificial noise from its detections. The drawback of having a threshold (as seen in 3.3) is that pixels that make up an object of interest can be suppressed, reducing the clarity of the object detections. RDE increases the overall frame processing time by a margin of between 20 – 50 ms per frame, depending on the processing cores and image dimensions used. The motion estimation components use the majority of the processing time, typically each component takes betweeen 50 - 100ms per frame. The objective of the experiments in this section is to explore the accuracy and computational performance of each component,
3.5 Motion Estimation Accuracy 57
and to identify any trade-offs. The artificial noise introduced by motion estimation is due to the discrete nature of pixels and the double-precision result usually associated with geometric calculations [1]. The sub-pixel localisation of warping of the frames causes mismatches in alignment, introducing the noise. Further artificial errors are introduced by the alignment function not being precise enough - even without the sub-pixel problem, pixel localisation can be inaccurate. This noise can be minimised by optimising the alignment function, which is done by optimising key point localisation and matching. Adding in extra optimisation methods increases the processing resource requirements; increasing processing time. The processing speed can be optimised by minimising the complexity or the number of key point localisation and matching processes. To explore the optimisation characteristics of motion estimation, experimentation was conducted on the following components of motion estimation:
• Key point Detection
• Key point Matching
• Key point Filtering
• Homography (affine transform)
Both the RANSAC method for selecting keypoints for the homography matrix, and the homography generation are based on sound mathematical principles [36] [41]. The components also contribute the least in terms of processing time consumed. At this point, it was decided not to experiment with modifying these components, as the above list of components have a greater impact on both processing time and matrix accuracy.
Key Point Detection
The accuracy of the homography matrix (the matrix used to warp a frame into the perspective of a reference frame), is determined by the accuracy and validity of the key points that are detected and the matching algorithm used to associate key points between two frames. In the work by Sadeghi-Tehran and Angelov [126], the SIFT algorithm from [78] is used as the keypoint detector. This experimentation will explore the use of different octave values with SIFT ([126] does not specify the octaves used for the experimentation). Additionally, different keypoint recognition algorithms will be explored. The four keypoint methods experimented with here are SIFT [78], SURF [14], BRISK [74], and ORB [2]. These key point methods are used because the development is chronologically progressive, and the code to run these algorithms is readily available in OpenCV (the Application Platform Interface
3.5 Motion Estimation Accuracy 58
(API) that is used by this project). The OpenCV implementation of each should provide a consistent code base so the code implementation doesn’t artificially affect the running time.
Key Point Matching
The speed of the matching process is slower if the video stream represent an environment that produces a large number of key points. Utilising a brute force matching approach leads to a fast matching result but may require greater filtering post-matching. The brute force matcher takes a sample from the first frame and it is matched with all other samples in second set using some distance calculation (typically Euclidean), the closest match (shortest distance measure) is returned. A number of different feature comparartors could be used to conduct keypoint matching [73, 77, 89, 2], however the Functional Link Artificial Neural Network (FLANN) is readily available in the software API being used and will form a consistent code set to the experiment. The objective here is to experiment with the effect of matching algorithms on speed and accuracy of the final stitching process, not appraise the matching algorithms themselves. The FLANN based method [91], uses a feature classifier to match the keypoints and is a single layer feed forward neural network.
Key Point Filtering
The output from the matching process can produce tens or hundreds of matches, some of which are outlier matches; they are not close in distance, but are matched because they are the closest match from the available keypoints. This experiment looks at the effect of keypoint match filtering, using two different types of match filter and how they effect the end result accuracy. The filtering process removes matches that are outliers based on some distance measure. One filter is a simple match filter that uses a distance threshold such that the distance measure of a match must be below this threshold. Anything outside the threshold is rejected and the match is discarded. This can be useful in scenarios where the motion of the scene between frames is known or is constrained such that a threshold does not exclude valid matches. A second filter, when the motion is unknown, calculates cross matches of keypoints. The matches from frame 1 to 2 are calculated, and then the matches from frame 2 to 1 are calculated. Only the keypoint matches that agree in both cases are retained as keypoints, with the remainder discarded. If there are only a few keypoint detections in a scene, this method can lead to over-filtering such that there are not enough keypoints remaining to construct the homography matrix.
3.5 Motion Estimation Accuracy 59
Homography
The sub-pixel alignment problem introduces artificial noise. This experiment considers interpolation as a method to optimise the alignment to minimise the artificial noise created by the stitching process. Simple nearest value interpolation is used in the ARTOD proposal, which is insufficient to avoid misalignments of pixels which result in false detections around the edges of objects in video sequences moving in more than one plane. Utilising bi-cubic interpolation [64] (because the warped frame does not align to discrete pixel values) during the stitching of frames could improve the alignment in sequences with more than one plane of motion. This method has been selected because it is efficient on modern hardware that could improve accuracy whilst being unlikely to introduce a large performance penalty.
The Video Sequences Helicopter chase
This is a video sequence where the camera is moving in one plane of motion, translational, following a motorbike and a car. This sequence has been selected because translational motion is less prone to noise on stitching and there is a limited complexity to the moving objects (two objects, in mostly a straight line).
Fig. 3.13 Helicopter chase scene with a motorbike and car
Street panning
This is a video sequence of a fixed camera moving in a rotational axis about the y-axis. This sequence has been selected because stitching of rotational motion is more prone to artificial errors than translational movement (because a 3-D component must be taken into account). The beginning of the sequence has no moving objects, so misalignments and noise can be seen clearer. Later in the sequence there are three moving cars which tests the noise performance when motion is introduced.