Incremental Self-calibrated Reconstruction from Video
por
Rafael Lemuz López
Tesis sometida como requisito parcial para obtener el grado de
DOCTOR EN CIENCIAS EN LA ESPECIALIDAD DE CIENCIAS
COMPUTACIONALES
en el
Instituto Nacional de Astrof´ısica, ´ Optica y Electr´onica
Abril 2008 Tonantzintla, Puebla
Supervisada por:
Dr. Miguel Octavio Arias Estrada, INAOE
°INAOE 2008 c
El autor otorga al INAOE el permiso de reproducir y distribuir copias en su totalidad o en
partes de esta tesis
Summary
Self-calibrated 3D reconstruction algorithms deal with the problem of recov- ering the three-dimensional structure of the scene and the camera motion using 2D images. A distinctive property of self-calibrated reconstruction methods is that camera calibration (the estimation of the camera intrinsic parameters: focal length, principal point, and radial lens distortion; and extrinsic parameters: orien- tation and position) is computed using intrinsic geometric information contained in the projective images of real scenes. Algorithms to solve 3D reconstruction problems heavily relay in finding correct matches between salient features that correspond to the same scene elements in different images. Then, by using corre- spondence data, a projective estimate of 3D scene structure and camera motion is computed. Finally using geometric constraints the camera parameters and the projective model are upgrade to a metric one.
This thesis proposes new algorithms to solve problems involved in self-calibrated reconstruction methods, including salient point detection, robust feature match- ing and projective reconstruction. An improved salient point detection algorithm is proposed, that ranks better interest points accordingly to the intuitive notion of corner points by computing directly the angular difference between dominant edges. A robust feature matching algorithm that merges spatial and appearance properties between putative match candidates that increase the number of cor- rect matches and discard false matches pairs is also proposed. In addition, a projective reconstruction algorithm is proposed that selects on-line the most con- tributing frames in the projective reconstruction process to overcome one of the intrinsic limitation of factorization like algorithms, to deal with the problem of key frame selection in the 3D self-calibrated pipeline. A full pipeline for a 3D reconstruction algorithm is developed with the proposed algorithms. Promising
results are shown and contributions and limitations of this work are discussed.
Resumen
Los algoritmos de reconstrucci´on 3D auto-calibrada tratan con el problema de recuperar la informaci´on 3D de una escena y el movimiento de la c´amara a partir de im´agenes. Una propiedad distintiva de los m´etodos de reconstrucci´on auto- calibrada es que los par´ametros intrinsecos de la c´amara: longitud focal, punto principal, e incluso la distorci´on radial; as´ı como los par´ametros extrinsecos: la orientaci´on y posici´on relativa de la c´amara con respecto a la escena se calculan utilizando informaci´on geom´etrica intrinsecamente contenida en las im´agenes de una escena real est´atica. Es decir, estos m´etodos no utilizan herramientas adi- cionales como motores de retroalimentaci´on para el c´alculo de la longitud focal o patrones de calibraci´on prefabricados.
Sin embargo, el proceso de reconstrucci´on autocalibrada, depende fuertemente de tener identificados puntos de correspondencia entre regiones de imagenes que representan al mismo elemento de la escena capturados desde puntos de obser- vaci´on diferentes. As´ı, utilizando unicamente puntos de correspondencia se obtiene una primera estimaci´on de la estructura de la escena y el movimento de la c´amara que no preserva distancias y ´angulos, llamada reconstrucci´on projectiva. Poste- riormente haciendo algunas suposciones e imponiendo restricciones sobre algunos par´ametros de la c´amara el modelo proyectivo se lleva a un modelo euclideando que difiere de la representaci´on de la escena real por un factor de escala y la orientaci´on original.
En esta tesis se proponen nuevos algoritmos para el problema de reconstrucci´on autocalibrada, en particular para los problemas de: detecci´on de puntos de inter´es, b´usqueda de correspondencias y reconstrucci´on proyectiva.
Se propone un algoritmo para la detecci´on de puntos de inter´es, que ordena mejor los puntos detectados de acuerdo a la noci´on intuitiva de esquina calculando
directamente la diferencia angular entre los bordes dominantes. Un nuevo algo- ritmo para la b´usqueda de correspondencias que integra propiedades espaciales y de apariencia en una m´etrica de similaridad entre posibles puntos de corresopon- dencia. El nuevo algoritmo incrementa el n´umero de pares de correspondencia y al mismo tiempo disminuye los errores de empatamiento. Adem´as, se propone un al- goritmo de reconstrucci´on proyectiva que selecciona en tiempo de ejecuci´on las im- agenes que mas contribuyen durante el proceso de reconstrucci´on para sobrepasar una de las limitaciones inerentes a los algoritmos de reconstrucci´on proyectiva basados en el m´etodo de factorizaci´on: la selecci´on de los frames m´as importantes durante el proceso completo reconstruci´on auto-calibrada. Finalmente, se mues- tran resultados prometedores y se discuten las contribuciones y limitaciones de este trabajo.
Acknowledgements
There are many people who have provided guidance, and support throughout the years to whom I wish thanks. First my advisor, Miguel Octavio Arias Estrada who has guided me through these years and has taught me what it means to be a researcher. Secondly to Patrick Hebert, who pointed me, the significance of clear and precise communication of research results. I want to thank to the Professors Leopoldo Altamirano Robles, Olac Fuentes Chaves and Aurelio L´opez L´opez because they have a great impact in my academic and professional skills giving me the opportunity to interact with them during my stay at the INAOE.
Then to Eliezer Jara for teaching me the way of systematic analysis in laboratory practices and share his invaluable experience in building prototypes for diverse computer vision applications which have an enormous impact in my professional formation. I also want to thank the interesting people I have met along the way whom I have the opportunity of interacting through informal discussions, and some provide support and encouragement, Blanca, Rita, Irene, Luis, Jorge, and Marco Aurelio. Specially I want to express my gratitude to Carlos Guillen for the hours invested in clarifying some mathematical concepts during the last year.
And the guys of the LVSN lab at Laval university, in particular to Jean-Daniel Deschˆenes and Jean-Nicolas Ouellet for make so pleasant the visit to Quebec.
Finally, I also want to recognize the facilities given by the technical staff of the INAOE in particular the people of the computer science department.
This research was done with the financial support of the CONACYT scholar- ship grant 184921.
Dedicatory
To my parents and brothers ....
1 Introduction 1
1.1 Overview of 3D reconstruction from video . . . 4
1.1.1 Interest point detector . . . 6
1.1.2 Matching correspondence . . . 7
1.1.3 Projective reconstruction . . . 8
1.1.4 Self-Calibration . . . 9
1.1.5 Rectification . . . 10
1.1.6 Dense Stereo Reconstruction . . . 10
1.2 Objectives . . . 11
1.2.1 Main Objective . . . 11
1.2.2 Particular Objectives . . . 11
1.3 Contributions . . . 12
1.3.1 Robust feature matching . . . 12
1.3.2 Incremental 3D reconstruction by inter-frame selection . . 12
1.4 Organization of the Thesis . . . 13
1.5 Conclusions . . . 13
2 Multiple View Geometry 15 2.1 Preliminaries . . . 15
2.1.1 Homogeneous Coordinates . . . 15
2.2 Camera Models . . . 16
2.2.1 Perspective model . . . 16
2.2.2 Orthographic Model . . . 19
2.2.3 Lens Distortion . . . 20
2.3 Multiple View Constraints . . . 20
2.3.1 Two view Geometry . . . 21
2.3.2 Fundamental Matrix estimation . . . 22
2.3.3 Planar Homography . . . 24
2.3.4 Homography estimation . . . 25
Number of Measurements . . . 26
2.3.5 Projective Reconstruction . . . 26
Merging Projective matrices using Epipolar Geometry . . . 26
The Factorization Method . . . 28
Non-linear Bundle Adjustment . . . 29
2.3.6 Incremental Projective Reconstruction . . . 30
2.4 3D Scene Reconstruction . . . 30
2.4.1 Camera Calibration . . . 30
2.4.2 Triangulation . . . 31
2.4.3 Survey of Camera Calibration . . . 32
Photogrammetric calibration . . . 32
Self-calibration . . . 33
2.4.4 Absolute Conic . . . 35
2.5 Stratified Self-calibration . . . 37
2.5.1 Affine Stratification . . . 38
2.6 RANSAC computation . . . 39
2.7 Conclusions . . . 40
3 The Correspondence Problem 41
3.1 Introduction . . . 41
3.2 Feature Correspondence Overview . . . 42
3.3 Salient point detection . . . 43
3.3.1 Pioneer Feature Detectors . . . 44
First Derivative Methods . . . 44
Second derivative methods . . . 46
Local energy methods . . . 47
Detectors of junction regions . . . 47
3.3.2 Invariant Feature Detectors . . . 48
3.4 Salient point Descriptor . . . 49
3.4.1 SIFT descriptor . . . 50
3.5 Matching salient points . . . 51
3.6 Geometric Constraints for Matching . . . 51
3.7 The importance of Gaussian Integration Scale and Derivative filters 53 3.8 Cov-Harris: Improved Harris corner Detection . . . 55
3.8.1 Segmentation of Partial Derivatives . . . 55
3.8.2 Edge direction estimation by Covariance Matrix . . . 57
3.8.3 Ranking Corner Points by the Angular difference between dominant edges . . . 58
3.9 Discussion . . . 60
4 IC-SIFT: Robust Feature Matching Algorithm 63 4.1 Introduction . . . 63
4.2 Related Work . . . 64
4.2.1 Scale Invariant Feature Transform . . . 66
4.2.2 Iterative Closest Point ICP . . . 68
4.3 IC-SIFT: Iterative Closest SIFT . . . 71
4.3.1 Finding Initial Matching Pairs . . . 71
4.3.2 Matching SIFT features: adding a weighted distance factor 72 4.3.3 Differencing Registration Error . . . 73
4.4 Robust feature Matching Experimental Results . . . 76
4.5 Discussion . . . 83
5 A new Incremental Projective Factorization Algorithm 85 5.1 Introduction . . . 85
5.2 Related Work . . . 86
5.3 Projective Factorization . . . 87
5.4 Proposed Incremental Projective Reconstruction Algorithm . . . . 91
5.4.1 Domain Reduction by inter-frame Selection . . . 91
5.4.2 Incremental Projective Reconstruction Algorithm . . . 93
5.5 Incremental Projective Reconstruction Experimental Results . . . 94
5.5.1 Incremental Projective Reconstruction Accuracy . . . 94
5.5.2 Processing Time . . . 95
5.5.3 Real Image Sequence experiments . . . 97
5.5.4 Conclusions . . . 98
6 Implementation and Experimental Results 99 6.1 Self-calibrated reconstruction from video experiments . . . 100
6.2 Salient Point detection . . . 101
6.3 Salient point detection by Harris algorithm . . . 102
6.4 Matching restricted list to estimate geometric constraints . . . 104
6.4.1 Robust fundamental matrix estimation . . . 105
6.4.2 Enforcing Epipolar Constraint for semi-dense matching . . 105
6.5 Projective and Euclidean Reconstruction . . . 108 6.6 Discussion . . . 110
7 Conclusions 113
7.1 Summary of contributions . . . 113 7.1.1 Robust feature matching for wide separated views . . . 114 7.1.2 Incremental 3D reconstruction by inter-frame selection . . 114 7.1.3 Robust feature matching on video sequences . . . 115 7.2 Future work . . . 115 7.2.1 Tracking algorithm with motion blur . . . 116 7.2.2 Inter-frame selection removing critical configurations . . . 117 7.2.3 Collaborative structure from motion . . . 117 7.2.4 Real-time processing . . . 117
Introduction
The recovering of Three-Dimensional information of a scene from multiple images captured with a camera is one of the fundamental problems of computer vision.
There are numerous methods to deal with this problem. The methods can be classified in different taxonomies according to the intrinsic properties of specific methods, for example by the kind of sensor (sonar, range laser, fringe projectors and inertial measurement units), by the possibility to change the scene by modify- ing lighting conditions (passive and active), by the source of information analyzed to extract depth information (shadows, texture, contour, geometry, focus, defo- cus, symmetry, disparity, reciprocity, light fields and photometry). A distinction between methods is done if the scene remains static or dynamic while process- ing information. When video cameras are used to recover depth information if the camera image formation mapping parameters are known then reconstruction methods are called pre-calibrated and self-calibrated when camera parameters are unknown.
The application of each method depends on the requirements of specific prob- lems ranging from accuracy, precision, processing speed, mobility, accessibility to information sources, natural ambient light modification, dimension constraints and budget to mention just a few. The ideal method for each singular applica- tion is a trade off between these and other constraints less clear as for example:
the need for portability, when human user interaction is allowed, the need for specific 3D model representation (depth map, voxels, mesh, level sets or vector fields), amount and quality of the generated information, i. e., some applications require a special model representation and full scene description of the scene while for others a sparse model representation can be enough. A distinguishing work that highlights the importance of using the same data representation in the whole reconstruction process from 3D reconstruction, partial view registration to ren- dering visualization is the work presented in [THL02, THL03, THDL04], where a common framework based on vector fields allows the real time reconstruction using range curves with a hand held scanner.
An in depth description of the 3D reconstruction methods is out of the scope of this thesis, we refer the interested reader to excellent recent surveys in different computer vision domains [Cur, SCMS01a, H´eb01, SCD+06].
This thesis deals with the estimation of the structure of a scene from images by self-calibrated methods. The method has attracted the attention of numerous research groups in recent years because this method can extract three-dimensional information from a set of images without previous knowledge of the camera. This problem is also called Structure from Motion (SfM) and self localization and mapping in the robotics literature. In the last few years, important progress has been done on this research area, but the problem is still hard to solve and there is no method that can be applied to general scenes and that fulfill most of the requirements expressed in the previous paragraphs. Assuming a static scene viewed with a camera having rigid motion, the problem has been formulated with several approaches and the state of the art research has focused its attention on individual image processing stages and in the developing of robust high level stages to recover the unknown camera parameters for different camera models using only a set of images as input data.
Some properties of the self-calibrated reconstruction method that highlight its advantages over more sophisticated ones with expensive set-ups (for example using: laser rage finders, pattern projectors, lighting arrays, Global Positioning
Systems and Inertial Measurement Units) are here described, some of them are derived from the fact that self-calibrated reconstruction can recover the structure and motion using only one moving camera:
• Automatic recovering of camera location and orientation with respect to the scene up to a Euclidean transformation.
• The possibility to compute an estimate of 3D models from a set of images taken with the same camera without further information.
• Low cost since in the last years the widespread use of video cameras has decreased their cost.
• Allows the three-dimensional reconstruction of close and far viewed scenes (indoor and outdoor model generation).
• Portability, mobility and less energy consumption.
Low-cost cameras are increasing their resolution and image quality, like those used in cellular phones make them feasible for self-calibrated 3D model recon- struction.
However some drawback of self-calibration methods when compared with those that use specialized setups are:
• Self-Calibration requires texturized information for modeling a scene, then the inability to cope with homogenous texturized scenes.
• A model is recovered with a sparse set of 3D points instead of dense depth maps.
• Low quality 3D models are recovered when compared to those methods using more complex Hardware components like structured light based methods.
• High dependency of establishing correspondences between salient points on images that represent the same scene element in conditions of wide separated views.
1.1 Overview of 3D reconstruction from video
The methods to recover 3D models from images taken with an uncalibrated cam- era [MHOP01, ST01, RP05, MP05, HZ00b, GSV01] presume that a sequence of images are available. The method relays in the assumption that the scene remains static while the capturing camera circumnavigates around the scene to be mod- eled. An important requirement of self-calibrated reconstruction methods is that the scene mostly contains a distinctive set of image regions that may be distin- guished in different views. The main processing steps involved in self-calibrated reconstruction are shown in figure 1.1.
The first step is to identify those salient points in the images. Pioneer ap- proaches for self-calibrated reconstruction used standard corner detection algo- rithms but recently the need for affine invariant point detection algorithm has emerged and important progress on this area has been done. The reason for the need of invariant salient point detection algorithm is that even small view point changes during image capture modify the appearance properties of salient points due to varying lighting conditions and projective deformation during the image formation process.
After a set of salient points has been identified the next step is to find for each salient point in the first image the corresponding feature points in subse- quent images that correspond to the same scene element. This problem is called the correspondence problem. An important assumption made during this step is that images do no differ too much between consecutive frames. This allows to restrict the search space for finding corresponding features in different images and match them using cross-correlation methods. However, since camera motion is unconstrained and unknown, more sophisticated approaches have appeared using invariant feature point descriptors, reducing the search space by using geomet- ric constraints and by computing robust estimates to select only the best match candidates.
The third stage assumes that the correspondence problem has been solved and
a measurement matrix with true salient point matches between all features has been built. Then, an estimate of camera motion and scene structure is recovered.
However, since camera parameters are unknown, the actual estimate is a projective representation of the real metric scene.
Thus, the next step is to find a projective mapping that transforms the projec- tive reconstruction to a metric reconstruction. There are two main approaches to solve this problem. The former is to explicitly estimate the affine transformation by finding the plane at infinity and then using the absolute conic (a special conic that lives in the plane at infinity) to find the camera parameters by imposing restrictions in the camera parameters (e.g. rectangular or square pixels, constant aspect ratio and principal point in the middle of the image).
The recovered model until this stage consists of a sparse set of points that differs from the real scene points by a scale factor that is solvable if one real distance between salient point is known from the scene. However, if we know the camera parameters, it is possible to compute a dense reconstruction of the scene by using standard stereo calibrated reconstruction frameworks.
If a dense map is needed, by using the camera parameters of a pair of images, a rectification process can be computed to align images in such a way that corre- sponding points can be found searching along a line. Then, dense robust stereo matching algorithms can establish correspondences between almost every pixel between images. However, even imposing this geometric restriction the problem is difficult due to the absence of texture information and occluding image areas.
Figure 1.1 illustrates the steps to achieve self-calibrated 3D modeling from video taken from [PGV+04]. Different state of the art methods have their own specific components but follow a similar pipeline.
In the following subsections there is an overview of the stages of the multiple view reconstruction method and the algorithms commonly used.
Figure 1.1: The steps to achieve self-calibration from multiple images taken from [PGV+04].
1.1.1 Interest point detector
The first step consists in automatically detecting ’interest points’ in the images that are sufficiently different from their neighbor pixels.
Numerous algorithms have been proposed to extract interest points from im- ages. Different region properties around a point are used to define what points in an image are ’interesting’. Some detectors find points of highly varying texture, while others locate corner points. Corner points are formed when two or more non parallel edges meet. An edge in an image is a sharp variation of the inten-
sity function. Edges usually define the boundary between two different objects or parts of the same object.
In general, interest points detectors find areas of images with high variance in at least two directions. The variance along different directions computed using all pixels in a window centered about a point are good measures of the distinct- ness. Usually the Harris and Stephens’ corner detector is selected for doing this task [HS88] since, the corner responses estimated by the Harris operator through eigenvalues analysis has the property of being invariant to scale when using pyra- midal processing as in [Lin98, MS02]. Even though, there are other alternatives as Sojka [Soj03], Susan [SB97], and KLT [KT91, ST94], a recent study of the corner stability and corner localization properties of the features extracted by different algorithms suggest that the KLT and Harris corner detectors are more suitable for tracking features in long sequences [TS04]. State of the art algorithms have extended the Harris algorithm to make it stable under affine image transforma- tions [MS02, TG04, MS05a] and applicable in scenarios where small view point changes modify the local appearance of salient points [Low99].
1.1.2 Matching correspondence
After detecting interest points, the next step is to track those features across different images in a video sequence. The goal is to find for every interest point in the first image the corresponding point in subsequent images associated with the same scene element.
The correspondence problem has been studied in depth in two different setups.
In the ’stereo’ correspondence problem where the camera motion is restricted to be mainly translational and the images of the same scene are pre-aligned limiting the search of corresponding points to the same image row, see [SS02, BBH03] for recent reviews. But even under this constraints the problem is difficult to solve due to image noise, object occlusions, varying lighting conditions, the presence of specular highlights, shadows, and motion blur.
On the other hand, a harder setup of the correspondence problem occurs when the images are captured under large and unknown camera motion in the ’wide baseline matching’ due to perspective effects, varying scale, and stronger variations in lighting conditions.
The problem of looking for correspondences on video streams is known as multi feature tracking in the literature. Although many tracking algorithms exist [SPFP96, HB96, FTTR99, SHF01], the Kanade Lucas Tomasi tracking algorithm is commonly used [KT91]. When only few images of the object or the scene are available, wide-baseline matching methods can be used [ZDFL95a, FTG03].
These methods use affine invariant regions for matching images which are robust but, they are computationally more expensive [MS05a].
1.1.3 Projective reconstruction
Projective reconstruction is the best that can be done without camera calibration or additional metric information about the scene [Tri97]. Thus, knowing only feature correspondences the recovered camera pose and scene structure differs from the metric reconstruction by a projective transformation.
There are two kinds of methods (although many variants) for doing the pro- jective reconstruction step: Those, based on epipolar geometry and others based on factorization. In methods based on epipolar geometry [FLM92, GSV01, RP05, MP05], the first two images are used to initialize a reference frame. The world frame is aligned with the first camera and from the third image its fundamen- tal matrix rotation part is aligned with the fundamental matrix of the previous image. The epipolar geometry based method estimates camera motion and 3D structure for each view. When the last image is processed, a nonlinear optimiza- tion algorithm can refine the camera matrices and 3D structure.
Factorization methods solve the projective reconstruction problem using a data matrix (the image coordinates of corresponding point in all the images). The data matrix is factorized using singular value decomposition (SVD) into two matrices,
which represent object shape and camera motion respectively. The factorization method, first developed for the orthographic projection model [TK92a, TK92b]
was later extended to consider weak perspective, para-perspective, and projective camera models [MK94, PK97, ST01, HK00, MHOP01]. The factorization method is preferable to the epipolar due to its accuracy, numerical stability, robustness, and because it avoids computing the epipolar geometry which is prone to errors when the separation between images is short and then implicit human intervention is needed to select appropriate images.
1.1.4 Self-Calibration
A projective reconstruction does not preserve parallelism, length ratios, and an- gle between lines of real 3D scenes. The process of upgrading from projective reconstruction to a metric one where those properties are preserved is called self- calibration or auto-calibration. To upgrade from a projective reconstruction to a metric reconstruction both, the parameters of the perspective projection that model the image formation process and the camera location most be estimated.
Assuming that all images are taken by the same camera and some internal camera parameters are known, Euclidean structure of the scene can be recovered.
Furthermore, the camera calibration can be solved.
The first self-calibration method [FLM92, MF92] directly finds the intrinsic camera parameters that are consistent with the underlying projective geometry of a sequence of images using pairwise epipolar geometry.
Hartley et al [Har92, AZH96] proposed a non-linear least squares method to solve self-calibration but that needs a good initial guess for the unknowns. In [PG97] Pollefeys and Van Gool, extend the Hartley method where the projective reconstruction is first updated to an affine reconstruction and then to a metric reconstruction assuming variable focal length, while other camera parameters re- main constant. In [HK00] Han et al proposed a linear algorithm to recover the intrinsic parameters when the principal point and the focal lengths are unknown
and convert the projective solution to an Euclidean solution simultaneously.
The structure of the scene recovered after self-calibration is a sparse set of points. A dense depth map must be estimated to build a realistic 3D model.
Two additional steps can accomplish this task: Rectification and Dense Stereo Reconstruction.
1.1.5 Rectification
Taking two or more images with their corresponding matching points between them, the Rectification process exploits the epipolar constraint to align the images in such a way that all corresponding points will have the same y-coordinate in two images. This image transformation greatly reduces the search for feature point correspondences to a thin scan-line because due to image noise the estimated epipolar geometry is prone to small errors making necessary to extend the search for corresponding points to neighbor scan-lines.
When there is no close up motion between images, planar rectification can correctly align the images [Har98] by projecting both images onto a plane that is parallel to the baseline. To consider the case of forward/backward camera motion non-planar rectification algorithms project the images using projective matrices [FTV97, FTV00]. Roy et al proposed to use cylinder [RMC97] coordinates and Pollefeys et al used polar coordinates [PKG99] to reduce the computational cost.
1.1.6 Dense Stereo Reconstruction
Dense Stereo Reconstruction is the task of establishing a dense correspondence map between points of different calibrated views and recovering three-dimensional information for each pair of match points. When there are only two images avail- able the problem is called binocular stereo.
Combining information from several images makes the process more robust and precise which has evolved to the multi-stereo methods such as: voxel coloring,
space carving, and lightfields methods, which also benefit from known camera parameters. See [SCMS01b, LZWL04, SCD+06] for recent reviews about multi- ocular stereo reconstruction methods.
1.2 Objectives
1.2.1 Main Objective
The main objective of this thesis work is to develop new algorithms for incremental self-calibrated 3D reconstruction from video streams where, for each captured frame an estimate of camera pose and the structure of the scene can be computed on line.
However, since even individual stages of the full self-calibrated reconstruc- tion method are open problems, we have identified more specific objectives to solve some drawbacks of the state of the art algorithms. In particular, since self-calibrated reconstruction methods heavily depend on finding correct matches between points on images we decided to address this problem to improve the ap- plicability of the reconstruction methods dealing with real scenes captured with un-stabilized cameras. In addition, for the projective reconstruction algorithm based on the factorization method we identify the need for new algorithms with time limits constraints.
1.2.2 Particular Objectives
• Propose a robust matching algorithm to find corresponding salient points on different images, even when repetitive patterns exist on local areas of the scenes.
• Investigate a collaborative approach between stages of the reconstruction pipeline to improve the robustness of matching algorithms.
• Develop a new projective reconstruction algorithm based on the factoriza- tion method with time limit constraints.
1.3 Contributions
In this work new algorithms are proposed to solve specific problems of self- calibrated reconstruction from video. Specifically we improve on the following issues:
1.3.1 Robust feature matching
A robust feature matching algorithm is proposed [LLAE06a]. A matching metric is introduced to enforce geometric and photometric properties in the matching criterion. Thus, corresponding points are matched within an iterative framework using a local motion descriptor and the similarity between scale invariant region descriptors (SIFT) to avoid mismatch errors between distant points.
1.3.2 Incremental 3D reconstruction by inter-frame selec- tion
A new algorithm is presented for the selection of frames to recover the camera motion and scene structure for the projective camera model [LLAE06b] in the factorization method. By direct measurement of the contribution of each frame in the progressive quality of 3D model reconstruction allows the reduction of the memory resources and keeps the computational cost approximately constant for every frame of an image sequence.
1.4 Organization of the Thesis
The thesis is structured as follows. This Chapter was a general introduction to the problem of self-calibrated reconstruction and stated the objectives and contributions of this work. Then in Chapter 2 the background about Multiple View Geometry is reviewed. Chapter 3 presents a review of the state of the art in the correspondence problem.
Chapters 4 and 5 introduce the proposed algorithms, describe their proper- ties, and analyze the advantages of the new algorithms with respect to state of the art approaches. Chapter 4 presents a novel method for robust feature matching, and chapter 5, an incremental projective reconstruction algorithm. In chapter 6 experimental results are carried out and a performance evaluation of the pro- posed algorithms is analyzed and discussed in the applications of tracking and 3D reconstruction from video. Finally, in chapter 7 the conclusions and possible improvements in this research area are discussed.
1.5 Conclusions
In this chapter we have given a brief introduction to the problem of self-calibrated 3D reconstruction from images. The advantages and disadvantages of self-calibrated methods have been explained. Then the objectives and contributions were pre- sented. Finally a general overview of the document was introduced describing the thesis content.
Multiple View Geometry
2.1 Preliminaries
In this chapter some computer vision basics will be introduced. For a more thor- ough descritption, a book about geometry and 3D vision, for example [TV98, FL01, Atk01, HZ00a, FP02], is recommended.
2.1.1 Homogeneous Coordinates
In homogeneous or projective coordinates, the Euclidean 3D vector (x, y, z)T is represented by k(x, y, z, 1)T and the 2D vector (x, y)T is written as k(x, y, 1)T, where k 6= 0 is a real number. In the 2D case, this means that every point is represented by a line in 3D. Homogenous points with the last coordinate equal to zero do not have any counterpart in Euclidean space. These points are points at infinity and have an important role in the upgrade from a projective recon- struction to a metric one. Given the Euclidean point (x/k, y/k)T, in homoge- nous coordinates this point is represented by (x/k, y/k, 1)T, which is the same as (x, y, k)T. As k approximates to 0, the point goes to infinity in a certain direction.
(x, y, 0)T is the vanishing point. Using the homogeneous representation, the 2D line ax + by + c = 0 is written as k(a, b, c)T. This means that a point k(x, y, 1)T lies on the line k(a, b, c)T if and only if (x, y, 1)(a, b, c)T = 0. The intersection
between two 2D lines is computed as the cross product between the two lines.
Lines that are parallel in Euclidean space meet at infinity in projective space.
The projective geometry is useful to model the perspective mapping that oc- curs during the image formation process because using this geometry the perspec- tive transformation of a camera is expressed as a linear operation.
2.2 Camera Models
In this section the projection model from real 3D scene points into an image point is revised.
2.2.1 Perspective model
Let us consider the perspective model that is shown in figure 2.1. Every 3D scene point X(X, Y, Z) is projected on the image plane to a point x(u, v) through the optical center C. The optical axis is a perpendicular line to the image plane passing through the optical center. The center of radial symmetry in the image or principal point, (i.e., the point of intersection of the optical axis and the image plane) is given by O. The distance between C (the optical center) and the image plane is the focal length f . We define the camera coordinate system as follows.
The optical center of the camera is the origin of the coordinate system. The image plane is parallel to the XY plane, held at a distance of f from the origin. Using the basic laws of trigonometry the following relations are derived:
u = f X
Z , v = f Y Z
Once expressed in homogeneous coordinates the above relations transform to the following:
u v 1
∼
f 0 0 0 0 f 0 0 0 0 1 0
X Y Z 1
where the relationship ∼ stands for ’equal up to a scale’.
Figure 2.1: Projective Camera Model.
Practically all available digital cameras deviate from the perspective model. First, the principal point (u0, v0) does not necessarily lie on the geometrical center of the image. Second, the horizontal and vertical axes (u and v) of the image are not perpendicular. Let the angle between the two axes be θ. Finally, each pixel is not a perfect square and consequently we have fu and fv as the two focal lengths that are measured in terms of the unit lengths along the u and v directions. By incorporating these deviations in the camera model the transformation that maps scene points (X, Y, Z) to their image coordinates (u, v) is described as follows:
u v 1
∼
fu fvcotθ u0 0 0 sinθfv v0 0
0 0 1 0
X Y Z 1
In practice the 3D point is available in the world coordinate system that is different from the camera coordinate system. The motion between these coordinate systems
is given by (R, t):
u v 1
∼
fu fvcotθ u0 0 sinθfv v0
0 0 1
h
R −Rt i
X Y Z 1
P =
fu fvcotθ u0 0 sinθfv v0
0 0 1
h
R −Rt i
K =
fu fvcotθ u0 0 sinθfv v0
0 0 1
The 3 × 4 matrix P that projects a 3D scene point X to the corresponding image point x is called the projection matrix. The 3 × 3 matrix K that contains the internal parameters (u0, v0, θ, fu, fv) is generally referred to as the intrinsic matrix of a camera.
In back-projection, given an image point x, the goal is to find the set of 3D points that project to it. The back-projection of an image point is a ray in space. We can compute this ray by identifying two points on this ray. The first point can be the optical center C, since it lies on this ray. Since PC = 0, C is nothing but the right nullspace of P. Second, the point P+x, where P+ is the pseudoinverse
1 of P, lies on the back-projected ray because it projects to point p on the image.
Thus, the back-projection of p can be computed as follows.
(2.2.1) X(λ) = P+x + λC
The parameter λ allows to get different points on the back-projected ray.
1The pseudoinverse A+ of a matrix A is a generalization of the inverse and it exists for general (m, n) matrix. If m > n and if A has full rank (n) then A+= (ATA)−1AT.
2.2.2 Orthographic Model
Figure 2.2 shows the orthographic camera model. This is an affine camera model that has a projection matrix P in which the last row has a form (0, 0, 0, 1). In par- ticular, the orthographic camera model has a projection matrix P of the following form:
P =
1 0 0 0 0 1 0 0 0 0 0 1
R t
0 1
Figure 2.2: Orthographic Camera Model.
The projection of a 3D point X into the image point x is given below:
x = PX
Similar to the perspective camera, the back-projected ray is obtained as:
X(λ) = P+x + λC
However, the optical center C, which is the right null space of P, is a point at infinity in an orthographic camera.
Under the orthographic projection model, the projection (u, v) of the p − th point X = (X, Y, Z)T in 3D space into image frame f is given by the following expression:
(2.2.2) u = X, v = Y
2.2.3 Lens Distortion
The linear projection equations do not take into account the lens shape, which af- fects the projection in a non-linear way. The lens shape causes a radial lens distor- tion, (δrx, δry). Let (ux, vy)T be the projected point without lens distortion while (u, v)T represents the observed coordinates, and let r = p(u − u0)2+ (v − v0)2 be the radial distance from the principal point in the projected image, where (u0, v0) shift the center of the image to (0, 0). The radial lens distortion [Atk01] may then be approximated by the series
(2.2.3) δrx = u(K1r3+ K2r5+ ...) δry = v(K1r3+ K2r5+ ...)
The effects of radial distortion are shown in figure (2.3).
Figure 2.3: The effect of radial distortion.
2.3 Multiple View Constraints
In this section, we examine the relations that arise when a single scene is imaged by two or more camera. By analyzing these relationships the location of an image
point can be constrained to lay in a restricted image location. In addition, 3D reconstruction is solvable when a minimum of five or four real scene points are observed in two or three cameras positions respectively with varying viewpoint.
2.3.1 Two view Geometry
Figure 2.4 shows the inherent geometric constraints of two projective cameras imaging the same scene X. Two image points x and x0 are in correspondence when they are the image of the same world point.
Figure 2.4: Two view geometry constraints modeled by the fundamental matrix F. The two camera centers are indicated by C and C’. The camera centers, a 3D-space point X, and its images x and x’ lie in a common plane Π. The ray defined by the first camera center, C, and
the point X is imaged as a line l’. The 3D-space point X which projects to x must lie on l’.
X is a common world 3D point, x its image in the first view, x0 its image in the second view, C and C’ are the two cameras centers. The line segment that connects them is called the baseline. The points X, C, and C’ define a plane, called the epipolar plane Π. l and l0 are the epipolar lines of the two projections of X. The projection of the camera centers on the other images, e and e0, are named epipoles. The relation among all these elements forms the epipolar constraint.
The fundamental matrix F, is a 3 × 3 singular matrix describing the relation between two different images of the same scene. For corresponding points in two
images the following equation holds: x0TFx = 0, where x is a point in the first image and x0 is the corresponding point in the second image.
Assume that the 3D point X is projected to the point x in the first image in a stereo pair. l = Fx defines a line in the second image. This is called epipolar line, and it is the projection in the second camera of the line going through the first camera center and the 3D point X. If the point X is visible in the second camera, its image, x0, must lie on the epipolar line. In homogeneous coordinates a point x0 lies on a line Fx if and only if x0TFx = 0. This means that if we know the fundamental matrix, i.e. the epipolar geometry, for an image pair, stereo matching becomes much easier. To find correspondences between two images, the search is restricted along the epipolar line.
2.3.2 Fundamental Matrix estimation
The fundamental matrix can be recovered from only seven correspondences [FLM92, BS03] by means of non linear methods. However, if 8 correspondences are known linear algorithms exist to solve the problem, one of them is the eight point algo- rithm proposed by Hartley in [Har92, Har95].
Given x = (u, v, 1), x0 = (u0, v0, 1) two corresponding points expressed in ho- mogeneous coordinates, each match pair gives rise to one linear equation in the unknown entries of F :
(2.3.1) u0uf11+ u0vf12+ u0f13+ v0uf21+ v0vf22+ v0f23+ uf31+ vf31+ f33= 0 From a set of n point correspondences, we obtain a set of linear equations in the form:
(2.3.2) Af =
u01u1 u01v1 u01 v10u1 v01v1 v01 u1 v1 1 ... ... ... ... ... ... ... ... ... u0nun u0nvn u0n vn0un vn0vn vn0 un vn 1
F = 0,
where f is a 9-vector containing the entries of matrix F. The least-squares solution for F is the eigenvector corresponding to the smallest eigenvalue of A, that is the last column of V in the SVD, A = U DVT. This is the unconstrained fundamental matrix cF’ since the rank 2 constraint has not been enforced.
To obtain the correct rank 2 fundamental matrix, let the diagonal matrix obtained from SVD D = (d1, d2, d3), then the correct rank 2 fundamental matrix F is given by
(2.3.3) F = U × diag(d1, d2, 0) × VT = 0.
In general algebraic computations are unstable when using real image coordinates measurements due to large numerical variations. Hence normalization of the in- put data is required. Hartley in [Har95] proposed to normalize by centering the measurement data in the origin and make the mean distance of the measurements from the origin to have √
2. Transforming the image coordinates according to ˆ
x = Tx and ˆx0 = T0x0, where T and T0 are normalizing transformations consist- ing of translation and scaling:
(2.3.4) T =
1/σx 0 −µx/σx 0 1/σy −µy/σy
0 0 1
,
where, means µx and standard deviation σx are given by:
µx = 1 n
n
X
i=1
xi σx= v u u t 1 n
n
X
i=1
(xi− µx)2
µy = 1 n
n
X
i=1
yi σy = v u u t 1 n
n
X
i=1
(yi− µy)2
Then, after the fundamental matrix estimation, the obtained matrix eF must be de-normalized by F = T>F T.e
2.3.3 Planar Homography
The planar homography is a non-singular linear transformation that maps points between two different planes. The homography between two views plays an im- portant role in the geometry of multiple views [TV98, HZ00b].
When a planar object is imaged from multiple viewpoints or when a scene is imaged by cameras having the same optical center, the images are related by a unique homography. For a plane Π = [vT, 1] with a vector v in the scene, the ray corresponding to a point XΠ, projects to x0 in the other image (see figure 2.5).
Given the projection matrices P = [I | 0] and P’ = [A | a] for the two views, the homography induced by the plane is given by (assuming Π4 = 1 since the plane does not pass through the center of the first camera [HZ00a]):
Figure 2.5: A planar homography H maps a point x from the plane Π to a point x’ in the plane Π0.
(2.3.5) x0 = Hx with H = A − avT
If the cameras have different intrinsic matrices K’ and K respectively, the ho- mography due to the plane is given by [HZ00a]:
(2.3.6) H = K0(A − avT)K−1
2.3.4 Homography estimation
A homography H can be used to transfer feature points on a plane from one view to the other. A point x on the plane can be transferred to its image x0 on the other view using:
(2.3.7) x0 = Hx
where H is a 3x3 matrix known up to a scale factor, and hence has only 8 degrees of freedom. H can be estimated by a linear algorithm given a set of four point correspondences, (xi, x0i) as follows:
Expanding equation 2.3.7 for a given point correspondence, and normalizing with respect to the homogeneous component to yield,
(2.3.8) x0i = h1xi+ h2yi+ h3
h7xi+ h8yi+ h9 and yi0 = h4xi+ h5yi+ h6 h7xi+ h8yi+ h9
Rearranging the two equations leaves to two equations that are linear in the elements of the homography, H, i.e.
(2.3.9) Ah =
u1 v1 1 0 0 0 −u01u1 u01v1 −u1
0 0 0 u1 v1 1 −v01u1 v01v1 −v1 ... ... ... ... ... ... ... ... ... un vn 1 0 0 0 −u0nun u0nvn −un
0 0 0 un vn 1 −vn0un vn0vn −vn
h = 0;
hence, one point correspondence yields two equations. Then, at least four point correspondences are required for a rank deficient 8x9 matrix [HZ00a]. When more
than four point correspondences are known a least-squares estimate can solve for the unknown parameters hi.
Number of Measurements
The matrix H contains 9 entries, but it is defined only up to scale. Thus, the total number of degrees of freedom in a 2D projective transformation is 8.
Each corresponding 2D point or line between views, generates two constraints on H by Equation 2.3.8 and hence the correspondence of four points or four lines is sufficient to compute H. For a planar affine transformation with 6 degrees of freedom, only three corresponding points or lines are required, and so on.
A conic equation provides five constraints on a 2D homography. Hence two matching conics are sufficient to recover the homography.
In practice, salient points, lines, and conics detected in the image could be noisy to get a good solution using the minimum numbers of them. A large number of features is used to obtain a robust solution [HZ00a].
2.3.5 Projective Reconstruction
There are three principal approaches to recover the structure and motion of a scene from images up to a projective transformation: 1) Epipolar Geometry based methods that merge partial results; 2) Factorization methods where all the corre- spondences are treated simultaneously to get an estimate of the camera pose and 3) Robust non linear methods.
Merging Projective matrices using Epipolar Geometry
The epipoles contain information about the extrinsic camera parameters, namely the position of the camera center C and the orientation of the optical axis. How- ever, this information cannot be directly retrieved.
First, two images are selected and an initial reconstruction frame is setup. Then,
the pose of the camera for the other views is determined in this frame, and each time the initial reconstruction is refined and extended. In this way the pose estimation of views that have no common features with the reference views also becomes possible.
Defining the projection matrices P and P’ for the first and second views, respec- tively and choosing a specific canonical form for the camera matrices, in which the first camera is:
(2.3.10) P = [I3×3 0]
Note that it is always possible to make a set of camera matrices canonical by applying a projective transformation that is obtained as follows: augment the first matrix P by an additional row to make it a 4 × 4 non-singular matrix ˜P.
Then apply the homography H ∼ ˜P to all the cameras and world points as
(2.3.11) Pic∼ PiH−1, Xjc ∼ HXj
where Pi is the i-th camera and Xj the j-th world point (here the superscriptc is used to denote the transformed entities, also note that at this point we do not yet have world points, nor need them). Observe that the set of cameras is still not unique, we have a four parameter choice for the last row of ˜P (for finite cameras we can use (0, 0, 0, 1)T). For such a canonical pair of cameras P ∼ [I3×3 0] and P’ ∼ [M m].
As seen above, there is a four parameter choice in the set of canonical cameras.
Without further proof, the general four parameter formula for a pair of canonic camera matrices corresponding to a fundamental matrix F is given by
(2.3.12) Pic∼ PiH−1, P0 ∼ [[e0]xF + e0vT λe0]
where e0 is an epipole, v is any 3-vector and λ a non-zero scalar, together v and λ encode the four unknown parameters.
The Factorization Method
The factorization method described in [MHOP01] was first proposed for the or- thographic camera model. The original method assumes that n feature points are observed by m orthographic cameras. Then, by stacking the n correspond- ing points of the m frames a registered measurement matrix W with dimensions 3m × n is formed, as follows:
(2.3.13) W =
x11 x12 x1n ... ... ... xm1 xm2 xmn
=
P1 ... Pm
h
X1 · · · Xn i
where Xj = (xj, yj, zj, 1)T, (j = 1, · · · , n) are the unknown homogeneous 3D point vectors, Pi(i = 1, · · · , n) are the unknown 3 × 4 image projections matrix associated with camera i and, xij = (uij, vij, 1)T are the measured homogeneous image point vectors respectively.
Then, using Singular Value Decomposition (SVD), two matrices are computed, which represent object shape and camera motion respectively.
Mahamud et al in [MHOP01], proposed a bilinear iterative algorithm by adding new constraints on the error function minimized by the Sturm-Triggs method [ST01]. In the Mahamud et al method, initial projective depth values are obtained using the Kanade orthographic method [TK92b] as initial estimation avoiding the need to estimate projective depths (projective scale factors that represent the depth information lost during image projection) from epipolar geometry. They showed, that their minimization algorithm is guaranteed to converges to a local minimum for its error function. Implementation results shown that their method converge in less than 20 iterations and yields comparable errors than the Sturm- Triggs method [ST01].
Then, the full original iterative projective factorization algorithm [MHOP01] is described as follow:
1. Compute the current scaled measurement matrix W by equation (1);
2. Normalize W; subtracting the mean of each frame to every point P ;
3. Perform the rank-3 factorization on W by SVD, W = U SVT, to generate an estimate of projective matrix P and shape matrix X; P = U3 and X = SV3T where U3, S3 and V3 are the sub-matrices obtained from U, S and V using only the 3 first columns (the ones associated with the 3 largest eigenvalues) and S is a diagonal matrix with elements σ known as the singular values of W.
Algorithm 1 - Original projective factorization algorithm.
Non-linear Bundle Adjustment
Ideally, to solve the structure from motion problem the mean-squared distance between the observed image points and the points positions predicted from the parameters λij, Pi and Xj should be minimized, i.e.:
(2.3.14) E = min Σ k xij − 1
λijPiXj k2 .
However, the corresponding problem is difficult since the error is highly non-linear in the unknowns λij, Pi, and Xj.
Similar to the linear vs non-linear calibration algorithms, the main disadvantage of non-linear projective reconstruction methods is that they are iterative methods and they need an initial solution that has to be close enough to the real solution to avoid local minima. This is why linear methods are still useful in order to be used as an initial solution.
2.3.6 Incremental Projective Reconstruction
Time critical applications can accept sub-optimal reconstruction estimation ob- tained by incremental approaches. There are two categories of incremental tech- niques. The first is based on probabilistic methods like extended Kalman filter theory [BCC90, MRM94, SPFP96, Dav05, DRMS07] that is able to model the non- linearity between structure and motion estimates. Other probabilistic approaches use the particle filter [GTS+07, KRD07, TM06].
The second category relays on the subdivision of a video stream in sub-sequences and it is based on the concatenation of successive views related by the epipo- lar geometry. This problem has been investigated by Repko and Pollefeys in [MP05, PGV+02], counting the number of feature points tracked in successive pairs of images and analyzing the reprojection errors for pairs and triplets of views using epipolar geometry and homographies computing the Geometric Ro- bust Information Criterion (GRIC) proposed in [TFZ98]. Their keyframe criterion selects two or three views where the GRIC score of the epipolar model is lower than the score of the homography model. A similar idea was presented in [GCH+02].
Recently Martinec and Pajdla [MP05, MP06] have proposed incremental methods using triplets of images that can cope with missing data.
2.4 3D Scene Reconstruction
2.4.1 Camera Calibration
Geometric Camera calibration is a necessary step for recovering the 3D position of a scene point when only its projections in two images with different viewpoint are known. Each projection defines a ray in space; the intersection of both rays is the 3D point location.
By calibration we mean the determination of the intrinsic matrix (K) and external pose parameters (R, t) of the image formation model. Sometimes radial or other
distortions using additional parameters are modeled. The computed geometric model relates the 3D coordinates of a point of the scene, and the 2D coordinates of the projected point into the image.
2.4.2 Triangulation
Once camera calibration is known, it is possible to compute the 3D positions of the image points observed by multiple cameras through a process called triangulation.
Figure 2.6: Calibrated Reconstruction by Triangulation.
Triangulation in its simplest form is illustrated in Fig. 2.6: the direction towards a target position X in space is determined from two different locations.
Given a three-dimensional point X, the first step consists in finding the points xl and xr projected in the left and right image planes Il and Ir respectively. Then, point X lies on line Ll joining xl and the left optical center CL and, similarly, we know that X lies along a line Lr joining xr and CR. Assuming that the camera parameters (intrinsic and extrinsic) are known, the parameters of Ll and Lr can be explicitly computed. Therefore, the point X is at the intersection of the two lines. This procedure is called triangulation.
Knowing the projection equations for each camera view xl= PX and xr = P0X this equation can be combined into the form AX = 0, which is an equation linear
in X to find the three-dimensional location of point X.
Thus the triangulation process can be used to reconstruct a scene from point correspondences, but only after the set of camera views have been calibrated.
Finding the calibration matrices P and P0 of a set of camera views for a shared coordinate system means reconstructing the camera views. If only a projective reconstruction of the camera views can be determined, then only a projective reconstruction of the scene can be created (see Section 2.3.5).
2.4.3 Survey of Camera Calibration
Geometric calibration methods can be classified into two main groups depending on the nature of the information used:
Photogrammetric calibration
Photogrammetric methods use a calibration pattern with known geometry to cali- brate the cameras. The input to the calibration algorithm is the set of 3D points of the pattern and their corresponding 2D projections. To recover the 11 parameters that define the camera projection matrix P, the optimization of a cost criterion is computed. Linear methods for camera calibration known as DLT (Direct Linear Transform) were the first to appear [AAK71], formulating the calibration problem as the solution of a system of linear equations (see [HZ00a] or [FL01] for a detailed description and theory).
Figure 2.7: Examples of calibration patterns.