Incremental Self-calibrated Reconstruction from Video

(1)

Incremental Self-calibrated Reconstruction from Video

por

Rafael Lemuz López

Tesis sometida como requisito parcial para obtener el grado de

DOCTOR EN CIENCIAS EN LA ESPECIALIDAD DE CIENCIAS

COMPUTACIONALES

en el

Instituto Nacional de Astrof´ısica, ´ Optica y Electr´onica

Abril 2008 Tonantzintla, Puebla

Supervisada por:

Dr. Miguel Octavio Arias Estrada, INAOE

°INAOE 2008 c

El autor otorga al INAOE el permiso de reproducir y distribuir copias en su totalidad o en

partes de esta tesis

(2)

(3)

Summary

Self-calibrated 3D reconstruction algorithms deal with the problem of recovering the three-dimensional structure of the scene and the camera motion using 2D images. A distinctive property of self-calibrated reconstruction methods is that camera calibration (the estimation of the camera intrinsic parameters: focal length, principal point, and radial lens distortion; and extrinsic parameters: orientation and position) is computed using intrinsic geometric information contained in the projective images of real scenes. Algorithms to solve 3D reconstruction problems heavily relay in finding correct matches between salient features that correspond to the same scene elements in different images. Then, by using correspondence data, a projective estimate of 3D scene structure and camera motion is computed. Finally using geometric constraints the camera parameters and the projective model are upgrade to a metric one.

This thesis proposes new algorithms to solve problems involved in self-calibrated reconstruction methods, including salient point detection, robust feature matching and projective reconstruction. An improved salient point detection algorithm is proposed, that ranks better interest points accordingly to the intuitive notion of corner points by computing directly the angular difference between dominant edges. A robust feature matching algorithm that merges spatial and appearance properties between putative match candidates that increase the number of correct matches and discard false matches pairs is also proposed. In addition, a projective reconstruction algorithm is proposed that selects on-line the most con- tributing frames in the projective reconstruction process to overcome one of the intrinsic limitation of factorization like algorithms, to deal with the problem of key frame selection in the 3D self-calibrated pipeline. A full pipeline for a 3D reconstruction algorithm is developed with the proposed algorithms. Promising

(4)

results are shown and contributions and limitations of this work are discussed.

(5)

(6)

Resumen

Los algoritmos de reconstrucción 3D auto-calibrada tratan con el problema de recuperar la información 3D de una escena y el movimiento de la cámara a partir de imágenes. Una propiedad distintiva de los métodos de reconstrucción autocalibrada es que los parámetros intrinsecos de la cámara: longitud focal, punto principal, e incluso la distorción radial; as´ı como los parámetros extrinsecos: la orientación y posición relativa de la cámara con respecto a la escena se calculan utilizando información geométrica intrinsecamente contenida en las imágenes de una escena real estática. Es decir, estos métodos no utilizan herramientas adi- cionales como motores de retroalimentación para el cálculo de la longitud focal o patrones de calibración prefabricados.

Sin embargo, el proceso de reconstrucción autocalibrada, depende fuertemente de tener identificados puntos de correspondencia entre regiones de imagenes que representan al mismo elemento de la escena capturados desde puntos de obser- vación diferentes. As´ı, utilizando unicamente puntos de correspondencia se obtiene una primera estimación de la estructura de la escena y el movimento de la cámara que no preserva distancias y ángulos, llamada reconstrucción projectiva. Poste- riormente haciendo algunas suposciones e imponiendo restricciones sobre algunos parámetros de la cámara el modelo proyectivo se lleva a un modelo euclideando que difiere de la representación de la escena real por un factor de escala y la orientación original.

En esta tesis se proponen nuevos algoritmos para el problema de reconstrucción autocalibrada, en particular para los problemas de: detección de puntos de interés, búsqueda de correspondencias y reconstrucción proyectiva.

Se propone un algoritmo para la detección de puntos de interés, que ordena mejor los puntos detectados de acuerdo a la noción intuitiva de esquina calculando

(7)

directamente la diferencia angular entre los bordes dominantes. Un nuevo algoritmo para la búsqueda de correspondencias que integra propiedades espaciales y de apariencia en una métrica de similaridad entre posibles puntos de corresopon- dencia. El nuevo algoritmo incrementa el número de pares de correspondencia y al mismo tiempo disminuye los errores de empatamiento. Además, se propone un algoritmo de reconstrucción proyectiva que selecciona en tiempo de ejecución las imagenes que mas contribuyen durante el proceso de reconstrucción para sobrepasar una de las limitaciones inerentes a los algoritmos de reconstrucción proyectiva basados en el método de factorización: la selección de los frames más importantes durante el proceso completo reconstrución auto-calibrada. Finalmente, se mues- tran resultados prometedores y se discuten las contribuciones y limitaciones de este trabajo.

(8)

(9)

Acknowledgements

There are many people who have provided guidance, and support throughout the years to whom I wish thanks. First my advisor, Miguel Octavio Arias Estrada who has guided me through these years and has taught me what it means to be a researcher. Secondly to Patrick Hebert, who pointed me, the significance of clear and precise communication of research results. I want to thank to the Professors Leopoldo Altamirano Robles, Olac Fuentes Chaves and Aurelio L´opez L´opez because they have a great impact in my academic and professional skills giving me the opportunity to interact with them during my stay at the INAOE.

Then to Eliezer Jara for teaching me the way of systematic analysis in laboratory practices and share his invaluable experience in building prototypes for diverse computer vision applications which have an enormous impact in my professional formation. I also want to thank the interesting people I have met along the way whom I have the opportunity of interacting through informal discussions, and some provide support and encouragement, Blanca, Rita, Irene, Luis, Jorge, and Marco Aurelio. Specially I want to express my gratitude to Carlos Guillen for the hours invested in clarifying some mathematical concepts during the last year.

And the guys of the LVSN lab at Laval university, in particular to Jean-Daniel Deschˆenes and Jean-Nicolas Ouellet for make so pleasant the visit to Quebec.

Finally, I also want to recognize the facilities given by the technical staff of the INAOE in particular the people of the computer science department.

This research was done with the financial support of the CONACYT scholar- ship grant 184921.

(10)

(11)

Dedicatory

To my parents and brothers ....

(12)

(13)

1 Introduction 1

1.1 Overview of 3D reconstruction from video . . . 4

1.1.1 Interest point detector . . . 6

1.1.2 Matching correspondence . . . 7

1.1.3 Projective reconstruction . . . 8

1.1.4 Self-Calibration . . . 9

1.1.5 Rectification . . . 10

1.1.6 Dense Stereo Reconstruction . . . 10

1.2 Objectives . . . 11

1.2.1 Main Objective . . . 11

1.2.2 Particular Objectives . . . 11

1.3 Contributions . . . 12

1.3.1 Robust feature matching . . . 12

1.3.2 Incremental 3D reconstruction by inter-frame selection . . 12

1.4 Organization of the Thesis . . . 13

1.5 Conclusions . . . 13

2 Multiple View Geometry 15 2.1 Preliminaries . . . 15

2.1.1 Homogeneous Coordinates . . . 15

(14)

2.2 Camera Models . . . 16

2.2.1 Perspective model . . . 16

2.2.2 Orthographic Model . . . 19

2.2.3 Lens Distortion . . . 20

2.3 Multiple View Constraints . . . 20

2.3.1 Two view Geometry . . . 21

2.3.2 Fundamental Matrix estimation . . . 22

2.3.3 Planar Homography . . . 24

2.3.4 Homography estimation . . . 25

Number of Measurements . . . 26

2.3.5 Projective Reconstruction . . . 26

Merging Projective matrices using Epipolar Geometry . . . 26

The Factorization Method . . . 28

Non-linear Bundle Adjustment . . . 29

2.3.6 Incremental Projective Reconstruction . . . 30

2.4 3D Scene Reconstruction . . . 30

2.4.1 Camera Calibration . . . 30

2.4.2 Triangulation . . . 31

2.4.3 Survey of Camera Calibration . . . 32

Photogrammetric calibration . . . 32

Self-calibration . . . 33

2.4.4 Absolute Conic . . . 35

2.5 Stratified Self-calibration . . . 37

2.5.1 Affine Stratification . . . 38

2.6 RANSAC computation . . . 39

2.7 Conclusions . . . 40

(15)

3 The Correspondence Problem 41

3.1 Introduction . . . 41

3.2 Feature Correspondence Overview . . . 42

3.3 Salient point detection . . . 43

3.3.1 Pioneer Feature Detectors . . . 44

First Derivative Methods . . . 44

Second derivative methods . . . 46

Local energy methods . . . 47

Detectors of junction regions . . . 47

3.3.2 Invariant Feature Detectors . . . 48

3.4 Salient point Descriptor . . . 49

3.4.1 SIFT descriptor . . . 50

3.5 Matching salient points . . . 51

3.6 Geometric Constraints for Matching . . . 51

3.7 The importance of Gaussian Integration Scale and Derivative filters 53 3.8 Cov-Harris: Improved Harris corner Detection . . . 55

3.8.1 Segmentation of Partial Derivatives . . . 55

3.8.2 Edge direction estimation by Covariance Matrix . . . 57

3.8.3 Ranking Corner Points by the Angular difference between dominant edges . . . 58

3.9 Discussion . . . 60

4 IC-SIFT: Robust Feature Matching Algorithm 63 4.1 Introduction . . . 63

4.2 Related Work . . . 64

4.2.1 Scale Invariant Feature Transform . . . 66

4.2.2 Iterative Closest Point ICP . . . 68

(16)

4.3 IC-SIFT: Iterative Closest SIFT . . . 71

4.3.1 Finding Initial Matching Pairs . . . 71

4.3.2 Matching SIFT features: adding a weighted distance factor 72 4.3.3 Differencing Registration Error . . . 73

4.4 Robust feature Matching Experimental Results . . . 76

4.5 Discussion . . . 83

5 A new Incremental Projective Factorization Algorithm 85 5.1 Introduction . . . 85

5.2 Related Work . . . 86

5.3 Projective Factorization . . . 87

5.4 Proposed Incremental Projective Reconstruction Algorithm . . . . 91

5.4.1 Domain Reduction by inter-frame Selection . . . 91

5.4.2 Incremental Projective Reconstruction Algorithm . . . 93

5.5 Incremental Projective Reconstruction Experimental Results . . . 94

5.5.1 Incremental Projective Reconstruction Accuracy . . . 94

5.5.2 Processing Time . . . 95

5.5.3 Real Image Sequence experiments . . . 97

5.5.4 Conclusions . . . 98

6 Implementation and Experimental Results 99 6.1 Self-calibrated reconstruction from video experiments . . . 100

6.2 Salient Point detection . . . 101

6.3 Salient point detection by Harris algorithm . . . 102

6.4 Matching restricted list to estimate geometric constraints . . . 104

6.4.1 Robust fundamental matrix estimation . . . 105

6.4.2 Enforcing Epipolar Constraint for semi-dense matching . . 105

(17)

6.5 Projective and Euclidean Reconstruction . . . 108 6.6 Discussion . . . 110

7 Conclusions 113

7.1 Summary of contributions . . . 113 7.1.1 Robust feature matching for wide separated views . . . 114 7.1.2 Incremental 3D reconstruction by inter-frame selection . . 114 7.1.3 Robust feature matching on video sequences . . . 115 7.2 Future work . . . 115 7.2.1 Tracking algorithm with motion blur . . . 116 7.2.2 Inter-frame selection removing critical configurations . . . 117 7.2.3 Collaborative structure from motion . . . 117 7.2.4 Real-time processing . . . 117

(18)

(19)

Introduction

The recovering of Three-Dimensional information of a scene from multiple images captured with a camera is one of the fundamental problems of computer vision.

There are numerous methods to deal with this problem. The methods can be classified in different taxonomies according to the intrinsic properties of specific methods, for example by the kind of sensor (sonar, range laser, fringe projectors and inertial measurement units), by the possibility to change the scene by modify- ing lighting conditions (passive and active), by the source of information analyzed to extract depth information (shadows, texture, contour, geometry, focus, defo- cus, symmetry, disparity, reciprocity, light fields and photometry). A distinction between methods is done if the scene remains static or dynamic while processing information. When video cameras are used to recover depth information if the camera image formation mapping parameters are known then reconstruction methods are called pre-calibrated and self-calibrated when camera parameters are unknown.

The application of each method depends on the requirements of specific problems ranging from accuracy, precision, processing speed, mobility, accessibility to information sources, natural ambient light modification, dimension constraints and budget to mention just a few. The ideal method for each singular application is a trade off between these and other constraints less clear as for example:

(20)

the need for portability, when human user interaction is allowed, the need for specific 3D model representation (depth map, voxels, mesh, level sets or vector fields), amount and quality of the generated information, i. e., some applications require a special model representation and full scene description of the scene while for others a sparse model representation can be enough. A distinguishing work that highlights the importance of using the same data representation in the whole reconstruction process from 3D reconstruction, partial view registration to ren- dering visualization is the work presented in [THL02, THL03, THDL04], where a common framework based on vector fields allows the real time reconstruction using range curves with a hand held scanner.

An in depth description of the 3D reconstruction methods is out of the scope of this thesis, we refer the interested reader to excellent recent surveys in different computer vision domains [Cur, SCMS01a, H´eb01, SCD⁺06].

This thesis deals with the estimation of the structure of a scene from images by self-calibrated methods. The method has attracted the attention of numerous research groups in recent years because this method can extract three-dimensional information from a set of images without previous knowledge of the camera. This problem is also called Structure from Motion (SfM) and self localization and mapping in the robotics literature. In the last few years, important progress has been done on this research area, but the problem is still hard to solve and there is no method that can be applied to general scenes and that fulfill most of the requirements expressed in the previous paragraphs. Assuming a static scene viewed with a camera having rigid motion, the problem has been formulated with several approaches and the state of the art research has focused its attention on individual image processing stages and in the developing of robust high level stages to recover the unknown camera parameters for different camera models using only a set of images as input data.

Some properties of the self-calibrated reconstruction method that highlight its advantages over more sophisticated ones with expensive set-ups (for example using: laser rage finders, pattern projectors, lighting arrays, Global Positioning

(21)

Systems and Inertial Measurement Units) are here described, some of them are derived from the fact that self-calibrated reconstruction can recover the structure and motion using only one moving camera:

• Automatic recovering of camera location and orientation with respect to the scene up to a Euclidean transformation.

• The possibility to compute an estimate of 3D models from a set of images taken with the same camera without further information.

• Low cost since in the last years the widespread use of video cameras has decreased their cost.

• Allows the three-dimensional reconstruction of close and far viewed scenes (indoor and outdoor model generation).

• Portability, mobility and less energy consumption.

Low-cost cameras are increasing their resolution and image quality, like those used in cellular phones make them feasible for self-calibrated 3D model reconstruction.

However some drawback of self-calibration methods when compared with those that use specialized setups are:

• Self-Calibration requires texturized information for modeling a scene, then the inability to cope with homogenous texturized scenes.

• A model is recovered with a sparse set of 3D points instead of dense depth maps.

• Low quality 3D models are recovered when compared to those methods using more complex Hardware components like structured light based methods.

• High dependency of establishing correspondences between salient points on images that represent the same scene element in conditions of wide separated views.

(22)

1.1 Overview of 3D reconstruction from video

The methods to recover 3D models from images taken with an uncalibrated camera [MHOP01, ST01, RP05, MP05, HZ00b, GSV01] presume that a sequence of images are available. The method relays in the assumption that the scene remains static while the capturing camera circumnavigates around the scene to be modeled. An important requirement of self-calibrated reconstruction methods is that the scene mostly contains a distinctive set of image regions that may be distin- guished in different views. The main processing steps involved in self-calibrated reconstruction are shown in figure 1.1.

The first step is to identify those salient points in the images. Pioneer approaches for self-calibrated reconstruction used standard corner detection algorithms but recently the need for affine invariant point detection algorithm has emerged and important progress on this area has been done. The reason for the need of invariant salient point detection algorithm is that even small view point changes during image capture modify the appearance properties of salient points due to varying lighting conditions and projective deformation during the image formation process.

After a set of salient points has been identified the next step is to find for each salient point in the first image the corresponding feature points in subsequent images that correspond to the same scene element. This problem is called the correspondence problem. An important assumption made during this step is that images do no differ too much between consecutive frames. This allows to restrict the search space for finding corresponding features in different images and match them using cross-correlation methods. However, since camera motion is unconstrained and unknown, more sophisticated approaches have appeared using invariant feature point descriptors, reducing the search space by using geometric constraints and by computing robust estimates to select only the best match candidates.

The third stage assumes that the correspondence problem has been solved and

(23)

a measurement matrix with true salient point matches between all features has been built. Then, an estimate of camera motion and scene structure is recovered.

However, since camera parameters are unknown, the actual estimate is a projective representation of the real metric scene.

Thus, the next step is to find a projective mapping that transforms the projective reconstruction to a metric reconstruction. There are two main approaches to solve this problem. The former is to explicitly estimate the affine transformation by finding the plane at infinity and then using the absolute conic (a special conic that lives in the plane at infinity) to find the camera parameters by imposing restrictions in the camera parameters (e.g. rectangular or square pixels, constant aspect ratio and principal point in the middle of the image).

The recovered model until this stage consists of a sparse set of points that differs from the real scene points by a scale factor that is solvable if one real distance between salient point is known from the scene. However, if we know the camera parameters, it is possible to compute a dense reconstruction of the scene by using standard stereo calibrated reconstruction frameworks.

If a dense map is needed, by using the camera parameters of a pair of images, a rectification process can be computed to align images in such a way that corresponding points can be found searching along a line. Then, dense robust stereo matching algorithms can establish correspondences between almost every pixel between images. However, even imposing this geometric restriction the problem is difficult due to the absence of texture information and occluding image areas.

Figure 1.1 illustrates the steps to achieve self-calibrated 3D modeling from video taken from [PGV⁺04]. Different state of the art methods have their own specific components but follow a similar pipeline.

In the following subsections there is an overview of the stages of the multiple view reconstruction method and the algorithms commonly used.

(24)

Figure 1.1: The steps to achieve self-calibration from multiple images taken from [PGV⁺04].

1.1.1 Interest point detector

The first step consists in automatically detecting ’interest points’ in the images that are sufficiently different from their neighbor pixels.

Numerous algorithms have been proposed to extract interest points from images. Different region properties around a point are used to define what points in an image are ’interesting’. Some detectors find points of highly varying texture, while others locate corner points. Corner points are formed when two or more non parallel edges meet. An edge in an image is a sharp variation of the inten-

(25)

sity function. Edges usually define the boundary between two different objects or parts of the same object.

In general, interest points detectors find areas of images with high variance in at least two directions. The variance along different directions computed using all pixels in a window centered about a point are good measures of the distinct- ness. Usually the Harris and Stephens’ corner detector is selected for doing this task [HS88] since, the corner responses estimated by the Harris operator through eigenvalues analysis has the property of being invariant to scale when using pyra- midal processing as in [Lin98, MS02]. Even though, there are other alternatives as Sojka [Soj03], Susan [SB97], and KLT [KT91, ST94], a recent study of the corner stability and corner localization properties of the features extracted by different algorithms suggest that the KLT and Harris corner detectors are more suitable for tracking features in long sequences [TS04]. State of the art algorithms have extended the Harris algorithm to make it stable under affine image transformations [MS02, TG04, MS05a] and applicable in scenarios where small view point changes modify the local appearance of salient points [Low99].

1.1.2 Matching correspondence

After detecting interest points, the next step is to track those features across different images in a video sequence. The goal is to find for every interest point in the first image the corresponding point in subsequent images associated with the same scene element.

The correspondence problem has been studied in depth in two different setups.

In the ’stereo’ correspondence problem where the camera motion is restricted to be mainly translational and the images of the same scene are pre-aligned limiting the search of corresponding points to the same image row, see [SS02, BBH03] for recent reviews. But even under this constraints the problem is difficult to solve due to image noise, object occlusions, varying lighting conditions, the presence of specular highlights, shadows, and motion blur.

(26)

On the other hand, a harder setup of the correspondence problem occurs when the images are captured under large and unknown camera motion in the ’wide baseline matching’ due to perspective effects, varying scale, and stronger variations in lighting conditions.

The problem of looking for correspondences on video streams is known as multi feature tracking in the literature. Although many tracking algorithms exist [SPFP96, HB96, FTTR99, SHF01], the Kanade Lucas Tomasi tracking algorithm is commonly used [KT91]. When only few images of the object or the scene are available, wide-baseline matching methods can be used [ZDFL95a, FTG03].

These methods use affine invariant regions for matching images which are robust but, they are computationally more expensive [MS05a].

1.1.3 Projective reconstruction

Projective reconstruction is the best that can be done without camera calibration or additional metric information about the scene [Tri97]. Thus, knowing only feature correspondences the recovered camera pose and scene structure differs from the metric reconstruction by a projective transformation.

There are two kinds of methods (although many variants) for doing the projective reconstruction step: Those, based on epipolar geometry and others based on factorization. In methods based on epipolar geometry [FLM92, GSV01, RP05, MP05], the first two images are used to initialize a reference frame. The world frame is aligned with the first camera and from the third image its fundamental matrix rotation part is aligned with the fundamental matrix of the previous image. The epipolar geometry based method estimates camera motion and 3D structure for each view. When the last image is processed, a nonlinear optimization algorithm can refine the camera matrices and 3D structure.

Factorization methods solve the projective reconstruction problem using a data matrix (the image coordinates of corresponding point in all the images). The data matrix is factorized using singular value decomposition (SVD) into two matrices,

(27)

which represent object shape and camera motion respectively. The factorization method, first developed for the orthographic projection model [TK92a, TK92b]

was later extended to consider weak perspective, para-perspective, and projective camera models [MK94, PK97, ST01, HK00, MHOP01]. The factorization method is preferable to the epipolar due to its accuracy, numerical stability, robustness, and because it avoids computing the epipolar geometry which is prone to errors when the separation between images is short and then implicit human intervention is needed to select appropriate images.

1.1.4 Self-Calibration

A projective reconstruction does not preserve parallelism, length ratios, and angle between lines of real 3D scenes. The process of upgrading from projective reconstruction to a metric one where those properties are preserved is called self- calibration or auto-calibration. To upgrade from a projective reconstruction to a metric reconstruction both, the parameters of the perspective projection that model the image formation process and the camera location most be estimated.

Assuming that all images are taken by the same camera and some internal camera parameters are known, Euclidean structure of the scene can be recovered.

Furthermore, the camera calibration can be solved.

The first self-calibration method [FLM92, MF92] directly finds the intrinsic camera parameters that are consistent with the underlying projective geometry of a sequence of images using pairwise epipolar geometry.

Hartley et al [Har92, AZH96] proposed a non-linear least squares method to solve self-calibration but that needs a good initial guess for the unknowns. In [PG97] Pollefeys and Van Gool, extend the Hartley method where the projective reconstruction is first updated to an affine reconstruction and then to a metric reconstruction assuming variable focal length, while other camera parameters re- main constant. In [HK00] Han et al proposed a linear algorithm to recover the intrinsic parameters when the principal point and the focal lengths are unknown

(28)

and convert the projective solution to an Euclidean solution simultaneously.

The structure of the scene recovered after self-calibration is a sparse set of points. A dense depth map must be estimated to build a realistic 3D model.

Two additional steps can accomplish this task: Rectification and Dense Stereo Reconstruction.

1.1.5 Rectification

Taking two or more images with their corresponding matching points between them, the Rectification process exploits the epipolar constraint to align the images in such a way that all corresponding points will have the same y-coordinate in two images. This image transformation greatly reduces the search for feature point correspondences to a thin scan-line because due to image noise the estimated epipolar geometry is prone to small errors making necessary to extend the search for corresponding points to neighbor scan-lines.

When there is no close up motion between images, planar rectification can correctly align the images [Har98] by projecting both images onto a plane that is parallel to the baseline. To consider the case of forward/backward camera motion non-planar rectification algorithms project the images using projective matrices [FTV97, FTV00]. Roy et al proposed to use cylinder [RMC97] coordinates and Pollefeys et al used polar coordinates [PKG99] to reduce the computational cost.

1.1.6 Dense Stereo Reconstruction

Dense Stereo Reconstruction is the task of establishing a dense correspondence map between points of different calibrated views and recovering three-dimensional information for each pair of match points. When there are only two images available the problem is called binocular stereo.

Combining information from several images makes the process more robust and precise which has evolved to the multi-stereo methods such as: voxel coloring,

(29)

space carving, and lightfields methods, which also benefit from known camera parameters. See [SCMS01b, LZWL04, SCD⁺06] for recent reviews about multi- ocular stereo reconstruction methods.

1.2 Objectives

1.2.1 Main Objective

The main objective of this thesis work is to develop new algorithms for incremental self-calibrated 3D reconstruction from video streams where, for each captured frame an estimate of camera pose and the structure of the scene can be computed on line.

However, since even individual stages of the full self-calibrated reconstruction method are open problems, we have identified more specific objectives to solve some drawbacks of the state of the art algorithms. In particular, since self-calibrated reconstruction methods heavily depend on finding correct matches between points on images we decided to address this problem to improve the ap- plicability of the reconstruction methods dealing with real scenes captured with un-stabilized cameras. In addition, for the projective reconstruction algorithm based on the factorization method we identify the need for new algorithms with time limits constraints.

1.2.2 Particular Objectives

• Propose a robust matching algorithm to find corresponding salient points on different images, even when repetitive patterns exist on local areas of the scenes.

• Investigate a collaborative approach between stages of the reconstruction pipeline to improve the robustness of matching algorithms.

(30)

• Develop a new projective reconstruction algorithm based on the factorization method with time limit constraints.

1.3 Contributions

In this work new algorithms are proposed to solve specific problems of self- calibrated reconstruction from video. Specifically we improve on the following issues:

1.3.1 Robust feature matching

A robust feature matching algorithm is proposed [LLAE06a]. A matching metric is introduced to enforce geometric and photometric properties in the matching criterion. Thus, corresponding points are matched within an iterative framework using a local motion descriptor and the similarity between scale invariant region descriptors (SIFT) to avoid mismatch errors between distant points.

1.3.2 Incremental 3D reconstruction by inter-frame selec- tion

A new algorithm is presented for the selection of frames to recover the camera motion and scene structure for the projective camera model [LLAE06b] in the factorization method. By direct measurement of the contribution of each frame in the progressive quality of 3D model reconstruction allows the reduction of the memory resources and keeps the computational cost approximately constant for every frame of an image sequence.

(31)

1.4 Organization of the Thesis

The thesis is structured as follows. This Chapter was a general introduction to the problem of self-calibrated reconstruction and stated the objectives and contributions of this work. Then in Chapter 2 the background about Multiple View Geometry is reviewed. Chapter 3 presents a review of the state of the art in the correspondence problem.

Chapters 4 and 5 introduce the proposed algorithms, describe their properties, and analyze the advantages of the new algorithms with respect to state of the art approaches. Chapter 4 presents a novel method for robust feature matching, and chapter 5, an incremental projective reconstruction algorithm. In chapter 6 experimental results are carried out and a performance evaluation of the proposed algorithms is analyzed and discussed in the applications of tracking and 3D reconstruction from video. Finally, in chapter 7 the conclusions and possible improvements in this research area are discussed.

1.5 Conclusions

In this chapter we have given a brief introduction to the problem of self-calibrated 3D reconstruction from images. The advantages and disadvantages of self-calibrated methods have been explained. Then the objectives and contributions were presented. Finally a general overview of the document was introduced describing the thesis content.

(32)

(33)

Multiple View Geometry

2.1 Preliminaries

In this chapter some computer vision basics will be introduced. For a more thor- ough descritption, a book about geometry and 3D vision, for example [TV98, FL01, Atk01, HZ00a, FP02], is recommended.

2.1.1 Homogeneous Coordinates

In homogeneous or projective coordinates, the Euclidean 3D vector (x, y, z)^T is represented by k(x, y, z, 1)^T and the 2D vector (x, y)^T is written as k(x, y, 1)^T, where k 6= 0 is a real number. In the 2D case, this means that every point is represented by a line in 3D. Homogenous points with the last coordinate equal to zero do not have any counterpart in Euclidean space. These points are points at infinity and have an important role in the upgrade from a projective reconstruction to a metric one. Given the Euclidean point (x/k, y/k)^T, in homogenous coordinates this point is represented by (x/k, y/k, 1)^T, which is the same as (x, y, k)^T. As k approximates to 0, the point goes to infinity in a certain direction.

(x, y, 0)^T is the vanishing point. Using the homogeneous representation, the 2D line ax + by + c = 0 is written as k(a, b, c)^T. This means that a point k(x, y, 1)^T lies on the line k(a, b, c)^T if and only if (x, y, 1)(a, b, c)^T = 0. The intersection

(34)

between two 2D lines is computed as the cross product between the two lines.

Lines that are parallel in Euclidean space meet at infinity in projective space.

The projective geometry is useful to model the perspective mapping that occurs during the image formation process because using this geometry the perspective transformation of a camera is expressed as a linear operation.

2.2 Camera Models

In this section the projection model from real 3D scene points into an image point is revised.

2.2.1 Perspective model

Let us consider the perspective model that is shown in figure 2.1. Every 3D scene point X(X, Y, Z) is projected on the image plane to a point x(u, v) through the optical center C. The optical axis is a perpendicular line to the image plane passing through the optical center. The center of radial symmetry in the image or principal point, (i.e., the point of intersection of the optical axis and the image plane) is given by O. The distance between C (the optical center) and the image plane is the focal length f . We define the camera coordinate system as follows.

The optical center of the camera is the origin of the coordinate system. The image plane is parallel to the XY plane, held at a distance of f from the origin. Using the basic laws of trigonometry the following relations are derived:

u = f X

Z , v = f Y Z

Once expressed in homogeneous coordinates the above relations transform to the following:





 u v 1







∼







f 0 0 0 0 f 0 0 0 0 1 0











 X Y Z 1







(35)

where the relationship ∼ stands for ’equal up to a scale’.

Figure 2.1: Projective Camera Model.

Practically all available digital cameras deviate from the perspective model. First, the principal point (u₀, v₀) does not necessarily lie on the geometrical center of the image. Second, the horizontal and vertical axes (u and v) of the image are not perpendicular. Let the angle between the two axes be θ. Finally, each pixel is not a perfect square and consequently we have f_u and f_v as the two focal lengths that are measured in terms of the unit lengths along the u and v directions. By incorporating these deviations in the camera model the transformation that maps scene points (X, Y, Z) to their image coordinates (u, v) is described as follows:





 u v 1







∼







f_u f_vcotθ u₀ 0 0 _sinθ^f^v v₀ 0

0 0 1 0











 X Y Z 1







In practice the 3D point is available in the world coordinate system that is different from the camera coordinate system. The motion between these coordinate systems

(36)

is given by (R, t):





 u v 1







∼







f_u f_vcotθ u₀ 0 _sinθ^f^v v₀

0 0 1





 h

R −Rt i





 X Y Z 1







P =







f_u f_vcotθ u₀ 0 _sinθ^f^v v₀

0 0 1





 h

R −Rt i

K =







f_u f_vcotθ u₀ 0 _sinθ^f^v v0

0 0 1







The 3 × 4 matrix P that projects a 3D scene point X to the corresponding image point x is called the projection matrix. The 3 × 3 matrix K that contains the internal parameters (u₀, v₀, θ, f_u, f_v) is generally referred to as the intrinsic matrix of a camera.

In back-projection, given an image point x, the goal is to find the set of 3D points that project to it. The back-projection of an image point is a ray in space. We can compute this ray by identifying two points on this ray. The first point can be the optical center C, since it lies on this ray. Since PC = 0, C is nothing but the right nullspace of P. Second, the point P⁺x, where P⁺ is the pseudoinverse

1 of P, lies on the back-projected ray because it projects to point p on the image.

Thus, the back-projection of p can be computed as follows.

(2.2.1) X(λ) = P⁺x + λC

The parameter λ allows to get different points on the back-projected ray.

1The pseudoinverse A⁺ of a matrix A is a generalization of the inverse and it exists for general (m, n) matrix. If m > n and if A has full rank (n) then A⁺= (A^TA)⁻¹A^T.

(37)

2.2.2 Orthographic Model

Figure 2.2 shows the orthographic camera model. This is an affine camera model that has a projection matrix P in which the last row has a form (0, 0, 0, 1). In particular, the orthographic camera model has a projection matrix P of the following form:

P =







1 0 0 0 0 1 0 0 0 0 0 1









 R t

0 1





Figure 2.2: Orthographic Camera Model.

The projection of a 3D point X into the image point x is given below:

x = PX

Similar to the perspective camera, the back-projected ray is obtained as:

X(λ) = P⁺x + λC

However, the optical center C, which is the right null space of P, is a point at infinity in an orthographic camera.

Under the orthographic projection model, the projection (u, v) of the p − th point X = (X, Y, Z)^T in 3D space into image frame f is given by the following expression:

(38)

(2.2.2) u = X, v = Y

2.2.3 Lens Distortion

The linear projection equations do not take into account the lens shape, which af- fects the projection in a non-linear way. The lens shape causes a radial lens distortion, (δ_rx, δ_ry). Let (u_x, v_y)^T be the projected point without lens distortion while (u, v)^T represents the observed coordinates, and let r = p(u − u₀)²+ (v − v₀)² be the radial distance from the principal point in the projected image, where (u₀, v₀) shift the center of the image to (0, 0). The radial lens distortion [Atk01] may then be approximated by the series

(2.2.3) δ_rx = u(K₁r³+ K₂r⁵+ ...) δ_ry = v(K₁r³+ K₂r⁵+ ...)

The effects of radial distortion are shown in figure (2.3).

Figure 2.3: The effect of radial distortion.

2.3 Multiple View Constraints

In this section, we examine the relations that arise when a single scene is imaged by two or more camera. By analyzing these relationships the location of an image

(39)

point can be constrained to lay in a restricted image location. In addition, 3D reconstruction is solvable when a minimum of five or four real scene points are observed in two or three cameras positions respectively with varying viewpoint.

2.3.1 Two view Geometry

Figure 2.4 shows the inherent geometric constraints of two projective cameras imaging the same scene X. Two image points x and x⁰ are in correspondence when they are the image of the same world point.

Figure 2.4: Two view geometry constraints modeled by the fundamental matrix F. The two camera centers are indicated by C and C’. The camera centers, a 3D-space point X, and its images x and x’ lie in a common plane Π. The ray defined by the first camera center, C, and

the point X is imaged as a line l’. The 3D-space point X which projects to x must lie on l’.

X is a common world 3D point, x its image in the first view, x⁰ its image in the second view, C and C’ are the two cameras centers. The line segment that connects them is called the baseline. The points X, C, and C’ define a plane, called the epipolar plane Π. l and l⁰ are the epipolar lines of the two projections of X. The projection of the camera centers on the other images, e and e⁰, are named epipoles. The relation among all these elements forms the epipolar constraint.

The fundamental matrix F, is a 3 × 3 singular matrix describing the relation between two different images of the same scene. For corresponding points in two

(40)

images the following equation holds: x^0TFx = 0, where x is a point in the first image and x⁰ is the corresponding point in the second image.

Assume that the 3D point X is projected to the point x in the first image in a stereo pair. l = Fx defines a line in the second image. This is called epipolar line, and it is the projection in the second camera of the line going through the first camera center and the 3D point X. If the point X is visible in the second camera, its image, x⁰, must lie on the epipolar line. In homogeneous coordinates a point x⁰ lies on a line Fx if and only if x^0TFx = 0. This means that if we know the fundamental matrix, i.e. the epipolar geometry, for an image pair, stereo matching becomes much easier. To find correspondences between two images, the search is restricted along the epipolar line.

2.3.2 Fundamental Matrix estimation

The fundamental matrix can be recovered from only seven correspondences [FLM92, BS03] by means of non linear methods. However, if 8 correspondences are known linear algorithms exist to solve the problem, one of them is the eight point algorithm proposed by Hartley in [Har92, Har95].

Given x = (u, v, 1), x⁰ = (u⁰, v⁰, 1) two corresponding points expressed in homogeneous coordinates, each match pair gives rise to one linear equation in the unknown entries of F :

(2.3.1) u⁰uf₁₁+ u⁰vf₁₂+ u⁰f₁₃+ v⁰uf₂₁+ v⁰vf₂₂+ v⁰f₂₃+ uf₃₁+ vf₃₁+ f₃₃= 0 From a set of n point correspondences, we obtain a set of linear equations in the form:

(2.3.2) Af =







u⁰₁u₁ u⁰₁v₁ u⁰₁ v₁⁰u₁ v⁰₁v₁ v⁰₁ u₁ v₁ 1 ... ... ... ... ... ... ... ... ... u⁰_nu_n u⁰_nv_n u⁰_n v_n⁰u_n v_n⁰v_n v_n⁰ u_n v_n 1







F = 0,

(41)

where f is a 9-vector containing the entries of matrix F. The least-squares solution for F is the eigenvector corresponding to the smallest eigenvalue of A, that is the last column of V in the SVD, A = U DV^T. This is the unconstrained fundamental matrix cF’ since the rank 2 constraint has not been enforced.

To obtain the correct rank 2 fundamental matrix, let the diagonal matrix obtained from SVD D = (d1, d2, d3), then the correct rank 2 fundamental matrix F is given by

(2.3.3) F = U × diag(d₁, d₂, 0) × V^T = 0.

In general algebraic computations are unstable when using real image coordinates measurements due to large numerical variations. Hence normalization of the input data is required. Hartley in [Har95] proposed to normalize by centering the measurement data in the origin and make the mean distance of the measurements from the origin to have √

2. Transforming the image coordinates according to ˆ

x = Tx and ˆx⁰ = T⁰x⁰, where T and T⁰ are normalizing transformations consist- ing of translation and scaling:

(2.3.4) T =







1/σ_x 0 −µ_x/σ_x 0 1/σ_y −µ_y/σ_y

0 0 1





 ,

where, means µ_x and standard deviation σ_x are given by:

µ_x = 1 n

n

X

i=1

x_i σ_x= v u u t 1 n

n

X

i=1

(x_i− µ_x)²

µ_y = 1 n

n

X

i=1

y_i σ_y = v u u t 1 n

n

X

i=1

(y_i− µ_y)²

Then, after the fundamental matrix estimation, the obtained matrix eF must be de-normalized by F = T^>F T.e

(42)

2.3.3 Planar Homography

The planar homography is a non-singular linear transformation that maps points between two different planes. The homography between two views plays an important role in the geometry of multiple views [TV98, HZ00b].

When a planar object is imaged from multiple viewpoints or when a scene is imaged by cameras having the same optical center, the images are related by a unique homography. For a plane Π = [v^T, 1] with a vector v in the scene, the ray corresponding to a point XΠ, projects to x⁰ in the other image (see figure 2.5).

Given the projection matrices P = [I | 0] and P’ = [A | a] for the two views, the homography induced by the plane is given by (assuming Π4 = 1 since the plane does not pass through the center of the first camera [HZ00a]):

Figure 2.5: A planar homography H maps a point x from the plane Π to a point x’ in the plane Π⁰.

(2.3.5) x⁰ = Hx with H = A − av^T

If the cameras have different intrinsic matrices K’ and K respectively, the homography due to the plane is given by [HZ00a]:

(43)

(2.3.6) H = K⁰(A − av^T)K⁻¹

2.3.4 Homography estimation

A homography H can be used to transfer feature points on a plane from one view to the other. A point x on the plane can be transferred to its image x⁰ on the other view using:

(2.3.7) x⁰ = Hx

where H is a 3x3 matrix known up to a scale factor, and hence has only 8 degrees of freedom. H can be estimated by a linear algorithm given a set of four point correspondences, (x_i, x⁰_i) as follows:

Expanding equation 2.3.7 for a given point correspondence, and normalizing with respect to the homogeneous component to yield,

(2.3.8) x⁰_i = h₁x_i+ h₂y_i+ h₃

h₇x_i+ h₈y_i+ h₉ and y_i⁰ = h₄x_i+ h₅y_i+ h₆ h₇x_i+ h₈y_i+ h₉

Rearranging the two equations leaves to two equations that are linear in the elements of the homography, H, i.e.

(2.3.9) Ah =







u1 v1 1 0 0 0 −u⁰₁u1 u⁰₁v1 −u1

0 0 0 u₁ v₁ 1 −v⁰₁u₁ v⁰₁v₁ −v₁ ... ... ... ... ... ... ... ... ... u_n v_n 1 0 0 0 −u⁰_nu_n u⁰_nv_n −u_n

0 0 0 un vn 1 −v_n⁰un v_n⁰vn −vn





 h = 0;

hence, one point correspondence yields two equations. Then, at least four point correspondences are required for a rank deficient 8x9 matrix [HZ00a]. When more

(44)

than four point correspondences are known a least-squares estimate can solve for the unknown parameters h_i.

Number of Measurements

The matrix H contains 9 entries, but it is defined only up to scale. Thus, the total number of degrees of freedom in a 2D projective transformation is 8.

Each corresponding 2D point or line between views, generates two constraints on H by Equation 2.3.8 and hence the correspondence of four points or four lines is sufficient to compute H. For a planar affine transformation with 6 degrees of freedom, only three corresponding points or lines are required, and so on.

A conic equation provides five constraints on a 2D homography. Hence two matching conics are sufficient to recover the homography.

In practice, salient points, lines, and conics detected in the image could be noisy to get a good solution using the minimum numbers of them. A large number of features is used to obtain a robust solution [HZ00a].

2.3.5 Projective Reconstruction

There are three principal approaches to recover the structure and motion of a scene from images up to a projective transformation: 1) Epipolar Geometry based methods that merge partial results; 2) Factorization methods where all the correspondences are treated simultaneously to get an estimate of the camera pose and 3) Robust non linear methods.

Merging Projective matrices using Epipolar Geometry

The epipoles contain information about the extrinsic camera parameters, namely the position of the camera center C and the orientation of the optical axis. How- ever, this information cannot be directly retrieved.

First, two images are selected and an initial reconstruction frame is setup. Then,

(45)

the pose of the camera for the other views is determined in this frame, and each time the initial reconstruction is refined and extended. In this way the pose estimation of views that have no common features with the reference views also becomes possible.

Defining the projection matrices P and P’ for the first and second views, respectively and choosing a specific canonical form for the camera matrices, in which the first camera is:

(2.3.10) P = [I_3×3 0]

Note that it is always possible to make a set of camera matrices canonical by applying a projective transformation that is obtained as follows: augment the first matrix P by an additional row to make it a 4 × 4 non-singular matrix ˜P.

Then apply the homography H ∼ ˜P to all the cameras and world points as

(2.3.11) P^ic∼ PⁱH⁻¹, X_j^c ∼ HXj

where Pⁱ is the i-th camera and X_j the j-th world point (here the superscript^c is used to denote the transformed entities, also note that at this point we do not yet have world points, nor need them). Observe that the set of cameras is still not unique, we have a four parameter choice for the last row of ˜P (for finite cameras we can use (0, 0, 0, 1)^T). For such a canonical pair of cameras P ∼ [I_3×3 0] and P’ ∼ [M m].

As seen above, there is a four parameter choice in the set of canonical cameras.

Without further proof, the general four parameter formula for a pair of canonic camera matrices corresponding to a fundamental matrix F is given by

(2.3.12) P^ic∼ PⁱH⁻¹, P⁰ ∼ [[e⁰]_xF + e⁰v^T λe⁰]

(46)

where e⁰ is an epipole, v is any 3-vector and λ a non-zero scalar, together v and λ encode the four unknown parameters.

The Factorization Method

The factorization method described in [MHOP01] was first proposed for the orthographic camera model. The original method assumes that n feature points are observed by m orthographic cameras. Then, by stacking the n corresponding points of the m frames a registered measurement matrix W with dimensions 3m × n is formed, as follows:

(2.3.13) W =







x₁₁ x₁₂ x_1n ... ... ... x_m1 x_m2 x_mn







=





 P₁ ... P_m





 h

X₁ · · · X_n i

where X_j = (x_j, y_j, z_j, 1)^T, (j = 1, · · · , n) are the unknown homogeneous 3D point vectors, P_i(i = 1, · · · , n) are the unknown 3 × 4 image projections matrix associated with camera i and, x_ij = (u_ij, v_ij, 1)^T are the measured homogeneous image point vectors respectively.

Then, using Singular Value Decomposition (SVD), two matrices are computed, which represent object shape and camera motion respectively.

Mahamud et al in [MHOP01], proposed a bilinear iterative algorithm by adding new constraints on the error function minimized by the Sturm-Triggs method [ST01]. In the Mahamud et al method, initial projective depth values are obtained using the Kanade orthographic method [TK92b] as initial estimation avoiding the need to estimate projective depths (projective scale factors that represent the depth information lost during image projection) from epipolar geometry. They showed, that their minimization algorithm is guaranteed to converges to a local minimum for its error function. Implementation results shown that their method converge in less than 20 iterations and yields comparable errors than the Sturm- Triggs method [ST01].

(47)

Then, the full original iterative projective factorization algorithm [MHOP01] is described as follow:

1. Compute the current scaled measurement matrix W by equation (1);

2. Normalize W; subtracting the mean of each frame to every point P ;

3. Perform the rank-3 factorization on W by SVD, W = U SV^T, to generate an estimate of projective matrix P and shape matrix X; P = U₃ and X = SV₃^T where U₃, S₃ and V₃ are the sub-matrices obtained from U, S and V using only the 3 first columns (the ones associated with the 3 largest eigenvalues) and S is a diagonal matrix with elements σ known as the singular values of W.

Algorithm 1 - Original projective factorization algorithm.

Non-linear Bundle Adjustment

Ideally, to solve the structure from motion problem the mean-squared distance between the observed image points and the points positions predicted from the parameters λ_ij, P_i and X_j should be minimized, i.e.:

(2.3.14) E = min Σ k xij − 1

λ_ijPiXj k² .

However, the corresponding problem is difficult since the error is highly non-linear in the unknowns λ_ij, P_i, and X_j.

Similar to the linear vs non-linear calibration algorithms, the main disadvantage of non-linear projective reconstruction methods is that they are iterative methods and they need an initial solution that has to be close enough to the real solution to avoid local minima. This is why linear methods are still useful in order to be used as an initial solution.

(48)

2.3.6 Incremental Projective Reconstruction

Time critical applications can accept sub-optimal reconstruction estimation obtained by incremental approaches. There are two categories of incremental tech- niques. The first is based on probabilistic methods like extended Kalman filter theory [BCC90, MRM94, SPFP96, Dav05, DRMS07] that is able to model the non- linearity between structure and motion estimates. Other probabilistic approaches use the particle filter [GTS⁺07, KRD07, TM06].

The second category relays on the subdivision of a video stream in sub-sequences and it is based on the concatenation of successive views related by the epipolar geometry. This problem has been investigated by Repko and Pollefeys in [MP05, PGV⁺02], counting the number of feature points tracked in successive pairs of images and analyzing the reprojection errors for pairs and triplets of views using epipolar geometry and homographies computing the Geometric Ro- bust Information Criterion (GRIC) proposed in [TFZ98]. Their keyframe criterion selects two or three views where the GRIC score of the epipolar model is lower than the score of the homography model. A similar idea was presented in [GCH⁺02].

Recently Martinec and Pajdla [MP05, MP06] have proposed incremental methods using triplets of images that can cope with missing data.

2.4 3D Scene Reconstruction

2.4.1 Camera Calibration

Geometric Camera calibration is a necessary step for recovering the 3D position of a scene point when only its projections in two images with different viewpoint are known. Each projection defines a ray in space; the intersection of both rays is the 3D point location.

By calibration we mean the determination of the intrinsic matrix (K) and external pose parameters (R, t) of the image formation model. Sometimes radial or other

(49)

distortions using additional parameters are modeled. The computed geometric model relates the 3D coordinates of a point of the scene, and the 2D coordinates of the projected point into the image.

2.4.2 Triangulation

Once camera calibration is known, it is possible to compute the 3D positions of the image points observed by multiple cameras through a process called triangulation.

Figure 2.6: Calibrated Reconstruction by Triangulation.

Triangulation in its simplest form is illustrated in Fig. 2.6: the direction towards a target position X in space is determined from two different locations.

Given a three-dimensional point X, the first step consists in finding the points x_l and x_r projected in the left and right image planes I_l and I_r respectively. Then, point X lies on line L_l joining x_l and the left optical center C_L and, similarly, we know that X lies along a line L_r joining x_r and C_R. Assuming that the camera parameters (intrinsic and extrinsic) are known, the parameters of L_l and L_r can be explicitly computed. Therefore, the point X is at the intersection of the two lines. This procedure is called triangulation.

Knowing the projection equations for each camera view x_l= PX and x_r = P⁰X this equation can be combined into the form AX = 0, which is an equation linear

(50)

in X to find the three-dimensional location of point X.

Thus the triangulation process can be used to reconstruct a scene from point correspondences, but only after the set of camera views have been calibrated.

Finding the calibration matrices P and P⁰ of a set of camera views for a shared coordinate system means reconstructing the camera views. If only a projective reconstruction of the camera views can be determined, then only a projective reconstruction of the scene can be created (see Section 2.3.5).

2.4.3 Survey of Camera Calibration

Geometric calibration methods can be classified into two main groups depending on the nature of the information used:

Photogrammetric calibration

Photogrammetric methods use a calibration pattern with known geometry to cali- brate the cameras. The input to the calibration algorithm is the set of 3D points of the pattern and their corresponding 2D projections. To recover the 11 parameters that define the camera projection matrix P, the optimization of a cost criterion is computed. Linear methods for camera calibration known as DLT (Direct Linear Transform) were the first to appear [AAK71], formulating the calibration problem as the solution of a system of linear equations (see [HZ00a] or [FL01] for a detailed description and theory).

Figure 2.7: Examples of calibration patterns.