Efficient Model-based 3D Tracking by Using Direct Image Registration

Texto completo

(1)UNIVERSIDAD POLITÉCNICA DE MADRID FACULTAD DE INFORMÁTICA. TESIS DOCTORAL Efficient Model-based 3D Tracking by Using Direct Image Registration. presentada en la FACULTAD DE INFORMÁTICA de la UNIVERSIDAD POLITÉCNICA DE MADRID para la obtención del GRADO DE DOCTOR EN INFORMÁTICA. AUTOR: DIRECTOR:. Enrique Muñoz Corral Luis Baumela Molina Madrid, 2012.

(2)

(3) i.

(4) ii.

(5) Agradecimientos La verdad es que los diez años (diez!) que he tardado en escribir esta tesis dan para muchas cosas, y si tuviera que agradecer algo a todas las personas que me han ayudado, necesitarı́a un capı́tulo entero. En primer lugar quisiera agradecer a Luis Baumela, gran director de tesis y mejor persona, el haber despertado en mı́ el gusanillo por la investigación, y sobre todo, por tener la suficiente paciencia para aguantar mis cabezonadas. Luis, si no fuera por tı́, no habrı́a entrado en la Universidad y estarı́a en la empresa privada ganando una pasta gansa—yeah, thank you so much! Gracias mil a Javier de Lope, por incansables discusiones técnicas y no tan técnicas y sobre todo a José Miguel Buenaposada, quien durante todos estos años me ha aguantado, ayudado, irritado, bromeado, e incluso buscado trabajo. No me puedo olvidar de los buenos ratos pasados en la hora de la comida junto con las “chicas” de estadı́stica (Maribel, Arminda, Concha y Juan Antonio), en las que han aguantado mis interminables peroratas sobre la burbuja inmobiliaria y los polı́ticos patrios. Un recuerdo también para todos los compañeros que han pasado por el laboratorio L-3202 durante estos años: los “chicos de Javi” (Javi, Juan, Bea y Yadira), Juan Bekios, los dos “Pablos” (Márquez y Herrero), Antonio y Rubén. Quisiera agradecer también a Lourdes Agapito por permitirme participar en el proyecto Automated facial expression analysis using computer vision, financiado por la Royal Society del Reino Unido. Gracias a este proyecto pude tener el privilegio de trabajar con Lourdes y con Xavier Lladó, y sobre todo de conocer a ese singular personaje llamado Alessio del Bue. No tengo palabras para agradecer a Alessio el ser tan majete y el aguantar estoicamente tantas veces como le hemos gorroneado. Tampoco puedo olvidarme de la ayuda prestada por el profesor Thomas Vetter y su grupo de la Universidad de Basilea (especialmente Brian Amberg y Pascal Paysan); ellos se tomaron la molestia de construir un modelo tridimensional de mi cara, incluyendo deformaciones y expresiones. No quisiera cerrar estos agradecimientos sin comentar que parte de los trabajos de esta tesis se han realizado bajo los proyectos del Ministerio de Ciencia y Tecnologı́a TIC2002-00591, y del Ministerio de Ciencia e Innovación TIN2008-06815-C02-02. Y por último, aunque no por ello menos importante, agradecer a Susana la paciencia que ha tenido todos estos años (que han sido muchos) en los que he estado liado con la tesis. Va por tı́, Susana! Enero de 2012. iii.

(6)

(7) Contents Resumen. xvii. Summary. xix. Notations. 1. 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . .. 5 7 8 9. 2 Literature Review 2.1 Image Registration vs. Tracking 2.2 Image Registration . . . . . . . 2.3 Model-based 3D Tracking . . . 2.3.1 Modelling assumptions . 2.3.2 Rigid Objects . . . . . . 2.3.3 Nonrigid Objects . . . . 2.3.4 Facial Motion Capture .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 3 Efficient Direct Image Registration 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . 3.2 Modelling Assumptions . . . . . . . . . . . . . . . 3.2.1 Imaging Geometry . . . . . . . . . . . . . 3.2.2 Brightness Constancy Constraint . . . . . 3.2.3 Image Registration by Optimization . . . . 3.2.4 Additive vs. Compositional . . . . . . . . 3.3 Additive approaches . . . . . . . . . . . . . . . . 3.3.1 Lucas-Kanade Algorithm . . . . . . . . . . 3.3.2 Hager-Belhumeur Factorization Algorithm 3.4 Compositional approaches . . . . . . . . . . . . . 3.4.1 Forward Compositional Algorithm . . . . . 3.4.2 Inverse Compositional Algorithm . . . . . 3.5 Other Methods . . . . . . . . . . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . v. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . .. 13 13 14 15 15 17 18 18. . . . . . . . . . . . . . .. 21 21 21 21 23 23 25 27 27 29 31 33 35 37 38.

(8) 4 Equivalence of Gradients 4.1 Image Gradients . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Image Gradients in R2 . . . . . . . . . . . . . . 4.1.2 Image Gradients in P2 . . . . . . . . . . . . . . 4.1.3 Image Gradients in R3 . . . . . . . . . . . . . . 4.2 The Gradient Equivalence Equation . . . . . . . . . . . 4.2.1 Relevance of the Gradient Equivalence Equation 4.2.2 General Approach to Gradient Replacement . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 5 Additive Algorithms 5.1 Gradient Replacement Requirements . . . . . . . . 5.2 Systematic Factorization . . . . . . . . . . . . . . . 5.3 3D Rigid Motion . . . . . . . . . . . . . . . . . . . 5.3.1 3D Textured Models . . . . . . . . . . . . . 5.3.2 Shape-induced Homography . . . . . . . . . 5.3.3 Change to the Reference Frame . . . . . . . 5.3.4 Optimization Outline . . . . . . . . . . . . . 5.3.5 Gradient Replacement . . . . . . . . . . . . 5.3.6 Systematic Factorization . . . . . . . . . . . 5.4 3D Nonrigid Motion . . . . . . . . . . . . . . . . . 5.4.1 Nonrigid Morphable Models . . . . . . . . . 5.4.2 Nonrigid Shape-induced Homography . . . . 5.4.3 Change of Variables to the Reference Frame 5.4.4 Optimization Outline . . . . . . . . . . . . . 5.4.5 Gradient Replacement . . . . . . . . . . . . 5.4.6 Systematic Factorization . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 39 39 40 42 43 45 46 46 48. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. 51 52 52 55 55 57 57 61 61 63 65 65 65 66 69 69 71 75. 6 Compositional Algorithms 6.1 Unravelling the Inverse Compositional Algorithm . . . 6.1.1 Change of Variables in IC . . . . . . . . . . . . 6.1.2 The Efficient Forward Compositional Algorithm 6.1.3 Rationale of the Change of Variables in IC . . . 6.1.4 Differences between IC and EFC . . . . . . . . . 6.2 Requirements for Compositional Warps . . . . . . . . . 6.2.1 Requirement on Warp Composition . . . . . . . 6.2.2 Requirement on Gradient Equivalence . . . . . 6.3 Other Compositional Algorithms . . . . . . . . . . . . 6.3.1 Generalized Inverse Compositional Algorithm . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 77 77 79 79 82 84 85 85 85 86 86 89. vi. . . . . . . . . . . . . . . . . ..

(9) 7 Computational Complexity 7.1 Complexity Measures . . . . . . . . . . . . 7.1.1 Number of Operations . . . . . . . 7.1.2 Complexity of Matrix Operations . 7.1.3 Comparing Algorithm Complexities 7.2 Algorithm Naming Conventions . . . . . . 7.2.1 Additive Algorithms . . . . . . . . 7.2.2 Compositional Algorithms . . . . . 7.3 Complexity of Algorithms . . . . . . . . . 7.3.1 Additive Algorithms . . . . . . . . 7.3.2 Compositional Algorithms . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 91 91 91 92 93 94 95 96 96 97 103 105. 8 Experiments 8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 8.2 Features and Measures . . . . . . . . . . . . . . . 8.2.1 Numerical Ranges for Features . . . . . . . 8.3 Generation of Synthetic Experiments . . . . . . . 8.3.1 Synthetic Datasets and Images . . . . . . 8.3.2 Generation of Result Plots . . . . . . . . . 8.4 Implementation Details . . . . . . . . . . . . . . . 8.4.1 Convergence Criteria . . . . . . . . . . . . 8.4.2 Visibility Management . . . . . . . . . . . 8.4.3 Scale of Homographies . . . . . . . . . . . 8.4.4 Minimization of Jacobian Operations . . . 8.5 Additive Algorithms . . . . . . . . . . . . . . . . 8.5.1 Experimental Hypotheses . . . . . . . . . 8.5.2 Experiments with Synthetic Rigid data . . 8.5.3 Experiments with Synthetic Nonrigid data 8.5.4 Experiments With Nonrigid Sequence . . . 8.5.5 Experiments with real Rigid data . . . . . 8.5.6 Experiment with real Nonrigid data . . . . 8.6 Compositional Algorithms . . . . . . . . . . . . . 8.6.1 Experimental Hyphoteses . . . . . . . . . 8.6.2 Experiments with Synthetic Rigid data . . 8.7 Discussion . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. 107 107 113 115 116 118 120 122 122 122 125 126 126 126 127 142 151 154 158 163 163 163 173. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 9 Conclusions and Future work 179 9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 179 9.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 A Gauss-Newton Optimization. 201. B Plane-induced Homography. 203 vii.

(10) C Plane+Parallax-constrained Homography 205 C.1 Compositional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 D Methodical Factorization D.1 Basic Definitions . . . . . . . . . . . . . . . . D.2 Lemmas that Re-organize Product of Matrices D.3 Lemmas that Re-organize Kronecker Products D.4 Lemmas that Re-organize Sums of Matrices .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 209 209 211 215 216. E Methodical Factorization of f 3DTM. 219. F Methodical Factorization of f 3DMM (Partial case). 223. G Methodical Factorization of f 3DMM (Full case). 225. H Detailed Complexity of Algorithms H.1 Warp f 3DTM . . . . . . . . . . . . . . . H.2 Warp f 3DMM . . . . . . . . . . . . . . . H.3 Jacobian of Algorithm HB3DTM . . . H.4 Jacobian of Algorithm HB3DTMNF . H.5 Jacobian of Algorithm HB3DMMNF H.6 Jacobian of Algorithm HB3DMMSF .. viii. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 235 . 235 . 236 . 237 . 239 . 241 . 246.

(11) List of Figures 1.1 1.2 1.3 1.4 1.5 1.6. Example of 3D rigid tracking. . . . . . . 3D Nonrigid Tracking. . . . . . . . . . . Image registration. . . . . . . . . . . . . . Industrial applications of 3D tracking. Motion capture in the film industry. . Markerless facial motion capture. . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . 6 . 6 . 7 . 9 . 10 . 11. 3.1 3.2 3.3 3.4 3.5 3.6 3.7. Imaging geometry. . . . . . . . . . . . . . . . . . . Iterative gradient descent image registration. . Generic descent method for image registration. Lucas-Kanade image registration. . . . . . . . . Hager-Belhumeur image registration. . . . . . . Forward compositional image registration. . . . Inverse compositional image registration. . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 22 24 26 28 32 34 36. 4.1 4.2 4.3 4.4 4.5 4.6. Depiction of Image Gradients. . . . . Image Gradient in P2 . . . . . . . . . . . Image gradient in R3 . . . . . . . . . . . Comparison between BCC and GEE. Gradients and Convergence. . . . . . . Open Subsets in Various Domains. . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 41 43 45 47 49 49. 5.1 5.2 5.3 5.4 5.5 5.6 5.7. 3D Textured Model. . . . . . . . . . . . . . . Shape-induced homographies. . . . . . . . . Warp defined on the reference frame. . . . Reference frame advantages. . . . . . . . . . Nonrigid Morphable Models. . . . . . . . . Nonrigid shape-induced homographies. . . Deformable warp defined on the reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . frame.. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 56 58 59 60 65 67 68. 6.1 6.2 6.3. Change of variables in IC. . . . . . . . . . . . . . . . . . . . . . . 80 Forward compositional image registration. . . . . . . . . . . . . 83 Generalized inverse compositional image registration. . . . . . 88. 7.1 7.2. Complexity of Additive Algorithms. . . . . . . . . . . . . . . . . 102 Complexities of Compositional Algorithms . . . . . . . . . . . 105 ix. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . ..

(12) 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 8.23 8.24 8.25 8.26 8.27 8.28 8.29 8.30 8.31 8.32 8.33 8.34 8.35 8.36 8.37 8.38 8.39 8.40 8.41 8.42 8.43 8.44. Registration vs. Tracking. . . . . . . . . . . . . . . . Algorithm initialization . . . . . . . . . . . . . . . . . Accuracy and convergence. . . . . . . . . . . . . . . Ground Truth and Noise Variance. . . . . . . . . . Definition of Datasets. . . . . . . . . . . . . . . . . . Example of Synthetic Datasets. . . . . . . . . . . . . Experimental Evaluation with Synthetic Data . . Visibility management. . . . . . . . . . . . . . . . . . Efficiently solving of WLS. . . . . . . . . . . . . . . . The cube model. . . . . . . . . . . . . . . . . . . . . . The face model. . . . . . . . . . . . . . . . . . . . . . The tea box model. . . . . . . . . . . . . . . . . . . . Results from dataset DS1 for cube. . . . . . . . . . . Results from dataset DS2 for cube. . . . . . . . . . . Results from dataset DS3 for cube. . . . . . . . . . . Results from dataset DS4 for cube. . . . . . . . . . . Results from dataset DS5 for cube. . . . . . . . . . . Results from dataset DS6 for cube. . . . . . . . . . . tea box sequence. . . . . . . . . . . . . . . . . . . . . . Results for the tea box sequence. . . . . . . . . . . . Estimated parameters from teabox sequence. . . . Estimated parameters from face sequence. . . . . . Good texture vs. bad texture. . . . . . . . . . . . . The face-deform model. . . . . . . . . . . . . . . . . . Distribution of Synthetic Datasets. . . . . . . . . . Results from dataset DS1 for face-deform. . . . . . Results from dataset DS2 for face-deform. . . . . . Results from dataset DS3 for face-deform. . . . . . Results from dataset DS4 for face-deform. . . . . . Results from dataset DS5 for face-deform. . . . . . Results from dataset DS6 for face-deform. . . . . . face-deform sequence. . . . . . . . . . . . . . . . . . . Results from face-deform sequence. . . . . . . . . . Estimated parameters from face-deform sequence. The cube-real model. . . . . . . . . . . . . . . . . . . The cube-real sequence. . . . . . . . . . . . . . . . . Results from cube-real sequence. . . . . . . . . . . . Selected facial scans used to build the model. . . . Unfolded texture model. . . . . . . . . . . . . . . . . The face-real sequence. . . . . . . . . . . . . . . . . Anchor points in the model. . . . . . . . . . . . . . . Results for the face-real sequence. . . . . . . . . . The plane model. . . . . . . . . . . . . . . . . . . . . . Distribution of Synthetic Datasets. . . . . . . . . . x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 109 110 114 117 118 119 121 123 125 128 128 129 130 131 132 133 134 135 136 137 138 140 141 142 143 145 146 147 148 149 150 151 152 153 154 156 157 158 159 160 161 162 164 165.

(13) 8.45 8.46 8.47 8.48 8.49 8.50 8.51. Results from dataset DS1 for plane. Results from dataset DS2 for plane. Results from dataset DS3 for plane. Results from dataset DS4 for plane. Results from dataset DS5 for plane. Results from dataset DS6 for plane. Average Time per iteration. . . . . .. 9.1 9.2 9.3. Spiderweb Plots for Image Registration Algorithms. Spherical Harmonics-based Illumination Model . . . . Tracking by simultaneously using texture and edges mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficient tracking using multiple views . . . . . . . . .. 9.4. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . . . infor. . . . . . . .. . . . . . . .. 167 168 169 170 171 172 176. . 182 . 184 . 185 . 186. B.1 Plane-induced homography. . . . . . . . . . . . . . . . . . . . . . 203 C.1 Plane+Parallax-constrained homograpy. . . . . . . . . . . . . . 206. xi.

(14)

(15) List of Tables 4.1. Characteristics of the warps . . . . . . . . . . . . . . . . . . . . . 50. 6.1 6.2. Relationship between compositional algorithms and warps . . 89 Requirements for Optimization Algorithms . . . . . . . . . . . 90. 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17. Complexity of matrix operations. . . . . . . . . . . . . . . . . . 93 Additive testing algorithms. . . . . . . . . . . . . . . . . . . . . . 95 Additive testing algorithms. . . . . . . . . . . . . . . . . . . . . . 96 Complexity of Algorithm LK3DTM. . . . . . . . . . . . . . . . . 97 Complexity of Algorithm HB3DTM. . . . . . . . . . . . . . . . 98 Complexity of Algorithm LK3DMM. . . . . . . . . . . . . . . . 98 Complexity of Algorithm HB3DMMNF. . . . . . . . . . . . . . 99 Complexity of Algorithm HB3DMM. . . . . . . . . . . . . . . . 100 Complexity of Algorithm HB3DMMSF. . . . . . . . . . . . . . 101 Complexities of Additive Algorithms. . . . . . . . . . . . . . . . 101 Complexity of Algorithm LKH8. . . . . . . . . . . . . . . . . . . 103 Complexity of Algorithm ICH8. . . . . . . . . . . . . . . . . . . 103 Complexity of Algorithm HBH8. . . . . . . . . . . . . . . . . . . 104 Complexity of Algorithm GICH8. . . . . . . . . . . . . . . . . . 104 Complexities of Compositional Algorithms. . . . . . . . . . . . 106 Comparison of Relative Complexities for Additive Algorithms106 Comparison of Relative Complexities for Compositional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106. 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11. Registration vs. tracking in efficient methods . . . . . Features and Measures. . . . . . . . . . . . . . . . . . . . Numerical Ranges for Features. . . . . . . . . . . . . . . Evaluated Additive Algorithms . . . . . . . . . . . . . . Ranges of parameters for cube experiments. . . . . . . Average reprojection error vs. noise for cube. . . . . . Ranges of parameters for face-deform experiments. . Average reprojection error vs. noise for face-deform. Evaluated Compositional Algorithms . . . . . . . . . . Ranges of motion parameters for each dataset. . . . . Average reprojection error vs. noise for plane. . . . . xiii. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 111 115 115 127 129 129 144 144 164 165 166.

(16) 9.1. Classification of Motion Warps. . . . . . . . . . . . . . . . . . . . 181. D.1 Lemmas used to re-arrange matrices product. . . . . . . . . . 214 D.2 Lemmas used to re-arrange Kronecker matrix products. . . . 216. xiv.

(17) List of Algorithms 1 2 3 4 5 6 7 8 9 10 11 12 13. Outline of the basic GN-based descent method for image registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline of the Lucas-Kanade algorithm. . . . . . . . . . . . . . Outline of the Hager-Belhumeur algorithm. . . . . . . . . . . . Outline of the Forward Compositional algorithm. . . . . . . . Outline of the Inverse Compositional algorithm. . . . . . . . . Iterative factorization of the Jacobian matrix. . . . . . . . . . Outline of the HB3DTM algorithm. . . . . . . . . . . . . . . . . Outline of the full-factorized HB3DMM algorithm. . . . . . . Outline of the HB3DMMSF algorithm. . . . . . . . . . . . . . . Outline of the Efficient Forward Compositional algorithm. . . Outline of the Generalized Inverse Compositional algorithm. Creating the synthetic datasets. . . . . . . . . . . . . . . . . . . Outline of the GN algorithm. . . . . . . . . . . . . . . . . . . . .. xv. 26 28 31 34 36 54 64 75 76 82 88 119 202.

(18)

(19) Resumen Esta tesis trata el problema de seguimiento eficiente de objectos 3D en secuencias de imágenes. Tratamos el problema del seguimiento 3D usando registrado de imágenes directo, una técnica que permite alinear dos imágenes usando sus niveles de intensidad. El registrado de imágenes se suele resolver usando métodos de optimización iterativa, donde la función a minimizar depende del error en los niveles de intensidad. En esta tesis examinaremos los métodos de registrado de imágenes más comunes, haciendo hincapié en aquellos que usan algoritmos eficientes de optimización. En esta tesis investigaremos dos formas de registrado eficiente. La primera incluye a los métodos aditivos de registrado: los parámetros de movimiento se calculan incrementalmente mediante una aproximación lineal de la función de error. Dentro de este tipo de algoritmos, nos centraremos en el método de factorización de Hager y Belhumeur. Introduciremos un requisito necesario que el algoritmo de factorización debe cumplir para tener una buena convergencia. Además, proponemos un procedimiento automático de factorización que nos permitirá seguir objetos 3D tanto rı́gidos como deformables. El segundo tipo son los llamados métodos composicionales de registrado, donde la norma de error se reescribe usando composición de funciones. Estudiaremos los métodos composicionales más usuales, haciendo hincapié en el método de registrado más rápido, el algoritmo composicional inverso. Introduciremos un nuevo método de registrado composicional, el algoritmo Efficient Forward Compositional, que nos permite interpretar los mecanismos de funcionamiento del algoritmo composicional inverso. Gracias a esta interpretación novedosa, enunciaremos dos requisitos fundamentales para algoritmos composicionales eficientes. Por último, realizaremos una serie de experimentos con datos reales y sintéticos para comprobar los postulados teóricos. Además, diferenciaremos entre los problemas de registrado y seguimiento para algoritmos eficientes: aquellos algoritmos que cumplan su(s) requisito(s) podrán usarse para registrado de imágenes, pero no para seguimiento.. xvii.

(20)

(21) Abstract This thesis deals with the problem of efficiently tracking 3D objects in sequences of images. We tackle the efficient 3D tracking problem by using direct image registration. This problem is posed as an iterative optimization procedure that minimizes a brightness error norm. We review the most popular iterative methods for image registration in the literature, turning our attention to those algorithms that use efficient optimization techniques. Two forms of efficient registration algorithms are investigated. The first type comprises the additive registration algorithms: these algorithms incrementally compute the motion parameters by linearly approximating the brightness error function. We centre our attention on Hager and Belhumeur’s factorization-based algorithm for image registration. We propose a fundamental requirement that factorization-based algorithms must satisfy to guarantee good convergence, and introduce a systematic procedure that automatically computes the factorization. Finally, we also bring out two warp functions to register rigid and nonrigid 3D targets that satisfy the requirement. The second type comprises the compositional registration algorithms, where the brightness function error is written by using function composition. We study the current approaches to compositional image alignment, and we emphasize the importance of the Inverse Compositional method, which is known to be the most efficient image registration algorithm. We introduce a new algorithm, the Efficient Forward Compositional image registration: this algorithm avoids the necessity of inverting the warping function, and provides a new interpretation of the working mechanisms of the inverse compositional alignment. By using this information, we propose two fundamental requirements that guarantee the convergence of compositional image registration methods. Finally, we support our claims by using extensive experimental testing with synthetic and real-world data. We propose a distinction between image registration and tracking when using efficient algorithms. We show that, depending whether the fundamental requirements are hold, some efficient algorithms are eligible for image registration but not for tracking.. xix.

(22)

(23) Notations Specific Sets and Constants X Set of target points or target region. Ω Set of target points currently visible. Number of points in the target region—i.e., N = kX k. N NΩ Number of visible target points—i.e., NΩ = kΩk. Dimension of the parameter space. P Number of image channels. C Dimension of the deformations space. K Number of frames in the image sequence. F. Vectors and Matrices a Lowercase bold letters denote vectors. Am×n Monospace uppercase letters denote m × n matrices. vec(A) Vectorization of matrix A: if A is a m × n matrix, vec(A) is a mn × 1 vector. Ik ∈ Mk×k k × k identity matrix. I 3 × 3 identity matrix. 0k ∈ Rk k × 1 vector full with zeroes. 0m×n ∈ Mm×n m × n matrix full with zeroes.. Camera Model Notations x ∈ R2 x̂ ∈ P2 X ∈ R3 Xc ∈ R3 K ∈ M3×3 P ∈ M3×4. Pixel location at the image. Location in the Projective space. Point in Cartesian coordinates Point expressed in the camera reference system. 3 × 3 camera intrinsics matrix. 3 × 4 camera projection matrix.. 1.

(24) Imaging Notations T (x) ∈ Rc I(x, t) ∈ Rc It (x) T,It [ ]. Optimization µ ∈ RP µ0 ∈ RP µi ∈ RP µ∗ ∈ RP µt ∈ RP µJ ∈ RP δµ ∈ RP ℓ(δµ) L(δµ) r(µ) ∈ RN ∇x̂ f (x) J(µ) ∈ MN×P H(µ) ∈ MP×P. Brightness value of the template image at pixel x. Brightness value of the current image for pixel x at instant t. Another notation for I(x, t). Vector forms of functions T and It . Composite function of I ◦ p, that is I[x] = I(p(x)).. Notations Column vector of motion parameters. Initial guess of the optimization. Parameters at the i-th iteration of the optimization. Actual optimum of the optimization. Parameters at image t. Parameters where the Jacobian is computed for efficient algorithms. Incremental step at the current state of the optimization. Linear model for the incremental step δµ. Local minimizer for the incremental step δµ. N × 1 vector-valued residuals function at parameters µ. Derivatives of function f with respect to variables x, instantiated at x. Jacobian matrix of the brightness dissimilarity at µ (i.e., J(µ) = ∇µ̂ D(X ; µ)). Hessian matrix of the brightness dissimilarity at µ (i.e., H(µ) = ∇2µ̂ D(X ; µ)).. Warp Function Notations f (x; µ) : Rn × RP 7→ Rn p : Rn 7→ R2 R ∈ M3×3 ri ∈ R3 t ∈ R3 D : R2 × Rp 7→ R U : Rp × Rp 7→ Rp ψ : Rp × Rp 7→ Rp. Motion model or Warp. Projection into the Cartesian plane. 3 × 3 rotation matrix. Columns of the rotation matrix R (i.e., R = (r1 , r2 , r3 )). Translation vector in Euclidean space. Dissimilarity function. Parameters update function. Jacobian update function for algorithm GIC.. 2.

(25) Factorization Notations ⊗ ⊙ S(x) M(µ) W ∈ Mp×p π : Rn 7→ Rn Pπ(n) ∈ Mn×n π(n, q). Kronecker product. Row-wise Kronecker product. Constant matrix in the factorization method that is computed from the target structure and camera calibration. Variable matrix in the factorization methods that is computed from motion parameters. Weighting matrix for Weighted Least-Squares. Permutation of the set {1, . . . , n}. Permutation matrix of the set {1, . . . , n}. Permutation of the set {1, . . . , n} with ratio q.. 3D Models Notations F ⊂ R2 S : F 7→ R3 T : F 7→ RC u∈F S ∈ M3×Nv s ∈ R3 s0 ∈ R3 si ∈ R3 n⊤ ∈ R3 Bs ∈ M3×K c ∈ RK HA ∈ M3×3 R˙∆ λ∈R v ∈ R3. Reference frame for algorithm HB. Target shape function. Target texture function. Target coordinates in the reference frame. Target 3D shape. Shape coordinates in the Euclidean space. Mean shape of the target generative model. i-th basis of deformation of the target generative model. Normal vector to a given triangle. n⊤ is normalized with the triangle depth (i.e., if x belongs to the triangle, then n⊤ x = 1). Basis of deformations. Vector containing K deformation coefficients. Affine warp between the image reference frame and F. Derivatives of the rotation matrix R with respect to the Euler angle ∆ = {α, β, γ}. Homogeneous scale factor. Change of variables defined as v = K−1 HA û.. Function Naming Conventions f H82D : P2 7→ P2 8-dof homography. f H6P : P2 7→ P2 Plane-induced homography. f H6S : P2 7→ P2 Shape-induced homography. f 3DTM : P2 7→ P2 3D Textured Model motion model. f H6D : P2 7→ P2 Deformable shape-induced homography. f 3DMM : P2 7→ P2 3D Textured Morphable Model motion model. Reprojection error function. ε : Rp 7→ R. 3.

(26) Algorithms Naming Conventions LK HB IC FC GIC EFC LKH8 LKH6 LK3DTM LK3DMM HB3DTR HB3DTM HB3DMM HB3DMMSF HB3DMMNF ICH8 ICH6 GICH8 GICH6 IC3DRT FCH6PP. 1. Lucas-Kanade algorithm [Lucas and Kanade, 1981]1 . Hager-Belhumeur factorization algorithm [Hager and Belhumeur, 1998]. Inverse Compositional algorithm [Baker and Matthews, 2004]. Forward Compositional algorithm [Baker and Matthews, 2004]. Generalized Inverse Compositional algorithm [Brooks and Arbel, 2010]. Efficient Forward Compositional algorithm. Lucas-Kanade algorithm for homographies. Lucas-Kanade algorithm for plane-induced homographies. Lucas-Kanade algorithm for 3D Textured Models (rigid). Lucas-Kanade algorithm for 3D Morphable Models (deformable). Full-factorized HB algorithm for 6-dof motion in R3 [Sepp, 2006]. Full-factorized HB algorithm for 3D Textured Models (rigid). Full-factorized HB algorithm for 3D Morphable Models (deformable). Semi-factorized HB algorithm for 3D Morphable Models. HB algorithm for 3D Morphable Models without the factorization stage. IC algorithm for homographies. IC algorithm for plane-induced homographies. IC algorithm for homographies. IC algorithm for plane-induced homographies. IC algorithm for 6-dof motion in R3 [Muñoz et al., 2005]. FC algorithm for plane+parallax homographies.. We only show the most relevant cite for each algorithm. 4.

(27) Chapter 1 Introduction This thesis deals with the problems of registration and tracking in sequences of images. Both problems are classical topics in Computer Vision and Image Processing that have been widely studied in the past. We summarize the subjects of this thesis in the dissertation title: Efficient Model-based 3D Tracking by using Direct Image Registration. What is 3D Tracking ? Let the target be a part of the scene—e.g. the cube in Figure 1.1. We define tracking as the process of repeatedly computing the target state in a sequence of images. When we describe this state as the relative 3D orientation and location of the target with respect the coordinate system of the camera (or another arbitrary reference system), we refer to this process as 3D rigid tracking (see Figure 1.1). If we also include state parameters that describe the possible deformation of the object, we have 3D nonrigid or deformable tracking (see Figure 1.2). We use 3D tracking to refer to both the rigid or the nonrigid case. What is Direct Image Registration? When the target is imaged by two cameras with different point-of-view, the resulting images are different although they represent the same portion of the scene (see Figure 1.3). Image Registration or Image Alignment computes the geometric transformation that best aligns the coordinate systems of both images such that their pixel-wise differences are minimal (cf. Figure 1.3). We say that the image registration is a direct method when we register the coordinate systems by just using the brightness differences of the images. What is Model-based ? We say that a technique is model-based when we restrict the information from the real world by using certain assumptions: on the target dynamics, on the target structure, on the camera sensing process, etc—e.g. in Figure 1.1 we model the target with a cube structure and rigid body dynamics. 5.

(28) Figure 1.1: Example of 3D rigid tracking (Left) Selected frames of a scene containing a textured cube. We track the object and we overlay its state in blue. (Right) The relative position of the camera—represented by a coloured pyramid—and the cube is computed from the estimated 3D parameters.. Figure 1.2: 3D Nonrigid Tracking. Selected frames from a sequence of a cushion under a bending motion. We track some landmarks on the cushion through the sequence, and we plot the resulting triangular mesh for the selected frames. The motion of the landmarks is both global—translation of the mesh—and local—changes on the relative position of the mesh vertices due to the deformation. Source: Alessio del Bue.. And Finally, What does Efficient mean? We say that a method is efficient if it substantially improves the computation time with respect to gold-standard techniques. In a more practical way, efficient is equivalent to real-time—i.e. the 6.

(29) Figure 1.3: Image registration (Top-row)Image of a portion of the scene under two distinct point-of-views. We have outlined the target in blue (Top-left) and green (Topright). (Bottom)The left image is warped such that the coordinates of the target match up in both images. Source:Graffiti sequence, from Oxford Visual Geometry Group.. tracking procedure operates at 25 frames per second.. 1.1. Motivation. In less than thirty years, and quite enclosed to academic or military environments, video tracking has a widespread acknowledgement mainly thanks to the media. 7.

(30) Thus, video tracking is now a staple in sci-fi shows and films where futuristic Headup Displays (hud) work in a show-and-tell fashion, a camera surveillance system can locate an object or a person, or a robot can address people and even recognize their mood. However, tv is, sadly to say, years ahead of reality. Actual video tracking systems are still in a primitive stage: they are inaccurate, sloppy, slow, and usually work in laboratory conditions only. Anyway, video tracking progression increases by leaps and bounds and it will probably match some sci-fi standards soon. We investigate the problem of efficiently tracking an object in a video sequence. Nowadays there exists several efficient optimization algorithms for video tracking or image registration. We study two of the fastest algorithms available: the HagerBelhumeur factorization algorithm and the Baker-Matthews inverse compositional algorithm. Both algorithms, although very efficient for planar registration, present diverse problems for 3D tracking. This thesis studies which assumptions can be done with these algorithms whilst underlining their limitations through extensive testing. Eventually, the objective is to provide a detail description of each algorithm, pointing out pros and cons, leading to a kind of Quick Guide to Efficient Tracking Algorithms.. 1.2. Applications. Typical applications for 3D tracking include target localization for military operations; security and surveillance tasks such as person counting, face identification, people detection, determining people activity or detecting left objects; it also includes human-computer interaction for computer security, aids for disabled people or even controlling video-games. Tracking is used for augmenting video sequences with additional information such as advertisements, expanding information about the scene, or adding or removing objects of the scene. We show some examples of actual industrial applications in Figure 1.4. A tracking process that is widely used in film industry is Motion Capture: we track the motion of the different parts of the an actor’s body using a suit equipped with reflective markers; then, we transfer the estimated motion to a computergenerated character (see Figure 1.5). Using this technique, we can animate a synthetic 3D character in a movie as Gollum in the Lord of the Rings trilogy (2001), or Jar-Jar Binks in the new Star Wars trilogy (1999). Another relevant movies that employ these techniques are Polar Express (2004), King Kong (2005), Beowulf (2007), A Christmas Carol (2009), and Avatar (2009). Furthermore, we can generate a complete computer-generated movie populated with characters animated through motion capture. Facial motion capture is of special interest for us: we animate a computer-generated facial expression by facial expression tracking (see Figure 1.5). We turn our attention to markerless facial motion capture, that is, the process of recovering the face expression and orientation without using fiducial markers. Markerless motion capture does not require special equipment—such as close-up 8.

(31) Figure 1.4: Industrial applications of 3D tracking. (Top-left) Augmented reality inserts virtual objects into the scene. (Top-middle) Augmented reality shows additional information about tracked objects in the scene. Source:Hawk-eye, Hawk-Eye Innovations Ltd., copyright c 2008. Top-right Tracking a pedestrian for video surveillance. Source: Martin Communications, copyright c 1998-2007. Bottom-left People flow counter by tracking. Source: EasyCount, by Keeneo, copyright c 2010. Bottom-middle Car tracking detects possible traffic infractions or estimates car speed. Source: Fibridge, copyright c . Bottom-right Body tracking is used for interactive controlling of video-games. Source: Kinect, Microsoft, copyright c 2010.. cameras—or a complicated set-up on the actor’s face—such as special reflective make-up or facial stickers. In this thesis we propose a technique that captures facial expressions motion by only using brightness information and a prior knowledge on the deformation of the target (see Figure 1.6).. 1.3. Contributions of the Thesis. We outline the remaining chapters of the thesis and their principal contributions as follows: Chapters2: Literature Review We provide a detailed survey of the literature on techniques for both image registration and tracking. Chapters3: Efficient Image Registration We review the state-of-the-art on efficient methods. We introduce the taxonomy for efficient registration algorithms: 9.

(32) Figure 1.5: Motion capture in the film industry. Facial and body motion capture from Avatar TM (Top-row) and Polar Express TM (Bottom-row). (Left-column) The body motion and head pose are computed using reflective fiducial markers—grey spheres of the motion capture jumpsuit. For facial expression capture they use plenty of smaller markers and even close-up cameras. (Right-column) They use the estimated motion to animate characters in the movie. Source: Avatar, 20th Century Fox, copyright c 2009; Polar Express, Warner Bros. Pictures, copyright c 2004.. an algorithm is classified as either additive or compositional.. Chapter 4: Equivalence of Gradients We introduce the gradient equivalence equation constraint: we show that the accomplishment of this assumption has positive effects on the performance of the algorithms.. Chapter 5: Additive Algorithms We review which constraints determine the convergence of additive registration algorithms, specially the factorization approach. We provide a methodical procedure to factorize an algorithm in general form; we state a basic set of theorems and lemmas that enable us to systematize the factorization. We introduce two tracking algorithms using factorization: one for rigid 3D objects, and other for deformable 3D objects. 10.

(33) Figure 1.6: Markerless facial motion capture. (Top) Several frames where the face modifies both its orientation—due to a rotation—and its shape structure—due to changes in facial expression. (Bottom) The tracking state vector includes both pose and deformation. Legend : Blue Actual projection of the target shape using the estimated parameters; Pink Highlighted projections corresponding to profiles of the jaw, eyebrows, lips and nasolabial wrinkles.. Chapter 6: Compositional Algorithms We review the basic inverse compositional algorithm. We introduce an alternative efficient compositional algorithm that is equivalent to the inverse compositional algorithm under certain assumptions. We show that if the gradient equivalent equation holds then both efficient compositional methods shall converge. Chapter 7: Computational Complexity We study the resources used by the registration algorithms in terms of their computational complexity. We compare the theoretical complexities of efficient and nonefficient algorithms. Chapter8: Experiments We devise a set of experimental tests that shall confirm our assumptions on the registration algorithms, that is, (1) the dependence of the convergence on the algorithm constraint, and (2) evaluate the theoretical complexities with actual data. Chapter 9: Conclusions and Future Work Finally, we drawn conclusions about where each technique is more suitable to be used, and we provide insight into future work to improve the proposed methods.. 11.

(34)

(35) Chapter 2 Literature Review In this chapter we review the basic literature on tracking and image registration. First we introduce the basic similarities and differences between image registration and tracking. Then, we review the usual methods for both tracking and image registration.. 2.1. Image Registration vs. Tracking. The frontier between image registration and tracking is a bit fuzzy: tracking identifies the location of an object in a sequence of images, whereas registration finds the pixel-to-pixel correspondence between a pair of images. Note that in both cases we compute a geometric and photometric transformation between images: pairwise in the context of image registration and among multiple images for the tracking case. Although we may indistinctly use the terms registration and tracking, we define the following subtle semantic differences between them: • Image registration finds the best alignment between two images of the same scene. We use use a geometric transformation to align the images of both cameras. We consider that image registration emphasizes in finding the best alignment between two images in visual terms, not in accurately recovering parameters of the transformation—this is usually the case in e.g., medical applications. • Tracking finds the location of a target object in each frame of a sequence. We assume that the difference of object position between two consecutive frames is small. In tracking we are typically interested in recovering the parameters describing the state of the object rather than the coordinates of the location: we can describe an object using richer information that just its position (e.g. 3D orientation, modes of deformation, lighting changes, etc.). This is usually the case in robotics [Benhimane and Malis, 2007; Cobzas et al., 2009; Nick Molton, 2004], or augmented reality [Pilet et al., 2008; Simon et al., 2000; Zhu et al., 2006]. 13.

(36) Also, image registration involves two images with arbitrary baseline whereas tracking usually operates in a sequence with a small inter-frame baseline. We assume that tracking is a higher level problem than image registration. Furthermore, we propose a tracking-by-registration approach: we track an object through a sequence by iteratively registering pairs of consecutive images [Baker and Matthews, 2004]; however, we can perform tracking without any registration at all (e.g. trackingby-detection [Viola and Jones, 2004], or tracking-by-classification [Vacchetti et al., 2004]).. 2.2. Image Registration. Image registration is a classic topic in computer vision and numerous approaches have been proposed in the literature; two good surveys in the subject are [Brown, 1992] and [Zitova, 2003]. The process involves computing the pixel-to-pixel correspondence between the two images: that is, for each pixel on one image we find the corresponding pixel in the other image so that both pixels project from the same actual point in the scene (cf. Figure 1.1). Applications include image mosaicing [Capel, 2004; Irani and Anandan, 1999; Shum and Szeliski, 2000], video stitching [Caspi and Irani, 2002], super-resolution [Capel, 2004; Irani and Peleg, 1991], region tracking [Baker and Matthews, 2004; Hager and Belhumeur, 1998; Lucas and Kanade, 1981], recovering scene/camera motion [Bartoli et al., 2003; Irani et al., 2002], or medical image analysis [Lester and Arridge, 1999]. Image registration methods commonly fall in one of the two following groups [Bartoli, 2008; Capel, 2004; Irani and Anandan, 1999]: Direct methods A direct image registration method aligns two images by only using the colour—or intensity in greyscale data—values of the pixels that are common to both images (namely, the region of support). Direct methods minimize an error measure based on image brightness from the region of support. Typical error measures include a L2 -norm of the brightness difference [Irani and Anandan, 1999; Lucas and Kanade, 1981], normalized crosscorrelation [Brooks and Arbel, 2010; Lewis, 1995], or mutual information [Dowson and Bowden, 2008; Viola and Wells, 1997]. Feature-based methods In feature-based methods, we align two images by computing the geometric transformation between a set of salient features that we detect in each image. The idea is to abstract distinct geometric image features that are more reliable than the raw intensity values; typically these features show invariance with respect to modifications of the camera point-ofview, illumination conditions, scale, or orientation of the scene [Schmid et al., 2000]. Corners or interest points [Bay et al., 2008; Harris and Stephens, 1988; Lowe, 2004; Torr and Zisserman, 1999] are classical features in the literature, although we can use other features such us edges [Bartoli et al., 2003], or extremal image regions [Matas et al., 2002]. 14.

(37) Direct or feature-based methods? Choosing between direct or feature-based methods is not an easy task: we have to know the strong points of each method and for what applications it is more suitable. A good comparison between the two types of methods is [Capel, 2004]. Feature-based methods typically show strong invariance to a wide range of photometric and geometric transformation of the image, and they are more robust to partial occlusions of the scene that their direct counterparts [Capel, 2004; Torr and Zisserman, 1999]. On the other hand, direct methods can align images with sub-pixel accuracy, estimate dominant motion even when multiple motion are present, and they can provide dense motion field in case of 3D estimation [Irani and Anandan, 1999]. Moreover, direct methods do not require high-frequency textured surfaces (corners) to operate, but have optimal performance with smooth graylevel transitions [Benhimane et al., 2007].. 2.3. Model-based 3D Tracking. In this section we define what is model-based tracking, and we review the previous literature on 3D tracking of rigid and nonrigid objects. A special case of interest for nonrigid objects is the 3D tracking of human faces or facial motion capture. Recovering the 3D orientation and position of the target can be done with respect to the camera (or an arbitrary reference system), or the relative displacement and orientation of the camera with respect to the target (or another arbitrary reference system in the scene), [Sepp, 2008]. A good survey on the subject is [Lepetit and Fua, 2005].. 2.3.1. Modelling assumptions. In model-based techniques we use a priori knowledge about the scene, the target, or the sensing device, as a basis for the tracking procedure. We classify these assumptions on the real-world information as follows: Target model The target model specifies how to represent the information about the structure of the scene in our algorithms. Template tracking or template matching simply represents the target as the pixel intensity values inside a region defined on one image: we call this region—or the image itself—the reference image or template. One of the first proposed technique for template matching was [Lucas and Kanade, 1981], although it was initially devised for solving optical flow problems. The literature proposes numerous extensions to this technique [Baker and Matthews, 2004; Benhimane and Malis, 2007; Brooks and Arbel, 2010; Hager and Belhumeur, 1998; Jurie and Dhome, 2002a]. We may also allow the target to deform its shape: this deformation induces changes in the target projected appearance. We model these changes in target texture by using generative models such as eigenimages [Black and Jepson, 1998; 15.

(38) Buenaposada et al., 2009], Active Appearance Models (aam) [Cootes et al., 2001], active blobs [Sclaroff and Isidoro, 2003], or subspace representation [Ross et al., 2004]. Instead of modelling brightness variations we may represent target shape deformation by using a linear model representing the location of a set of feature points [Blanz and Vetter, 2003; Bregler et al., 2000; Del Bue et al., 2004], or Finite Element Meshes [Pilet et al., 2005; Zhu et al., 2006]. Alternative approaches model non-rigid motion of the target by using anthropometric data [Decarlo and Metaxas, 2000], or by using a probability distribution of the intensity values of the target region [Comaniciu et al., 2000; Zimmermann et al., 2009]. These techniques are suitable to track planar objects of the scene. If we add further knowledge about the scene, we can track more complex objects: with a proper model we are able to recover 3D information. Typically, we use a wireframe 3D model of the target and tracking consists on finding the best alignment between the sensed image and the 3D model [Cipolla and Drummond, 1999; Kollnig and Nagel, 1997; Marchand et al., 1999]. We can augment this model by adding further texture priors either from the image stream [Cobzas et al., 2009; Muñoz et al., 2005; Sepp and Hirzinger, 2003; Vacchetti et al., 2004; Xiao et al., 2004a; Zimmermann et al., 2006], or from and external source (e.g. a 3D scanner or a texture mosaic) [Hong and Chung, 2007; La Cascia et al., 2000; Masson et al., 2004, 2005; Pressigout and Marchand, 2007; Romdhani and Vetter, 2003]. Motion model The motion model describes the target kinematics (i.e. how the object modifies its position in the image/scene). The motion model is tightly coupled to the target model: it is usually represented by a geometric transformation that maps the coordinates of the target model into a different set of coordinates. For a planar target, these geometric transformations are typically affine [Hager and Belhumeur, 1998], homographic [Baker and Matthews, 2004; Buenaposada and Baumela, 1999], or spline-based warps [Bartoli and Zisserman, 2004; Brunet et al., 2009; Lester and Arridge, 1999; Masson et al., 2005]. For actual 3D targets, the geometric warps account for computing the rotation and translation of the object using a 6 degreeof-freedom (dof) rigid body transformation [Cipolla and Drummond, 1999; La Cascia et al., 2000; Marchand et al., 1999; Sepp and Hirzinger, 2003]. Camera model The camera model specifies how the images are sensed by the camera. The pinhole camera models the imaging device as a projector of the coordinates of the scene [Hartley and Zisserman, 2004]. For tracking zoomed objects located far away, we may use orthographic projection [Brand and R.Bhotika, 2001; Del Bue et al., 2004; Tomasi and Kanade, 1992; Torresani et al., 2002]. The perspective projection accounts for perspective distortion, and it is more suitable for close-up views [Muñoz et al., 2005, 2009]. The camera model may also account for model deviations such as lens distortion [Claus and Fitzgibbon, 2005; Tsai, 1987]. 16.

(39) Other model assumptions We can also model prior photometric knowledge about the target/scene such as illumination cues [La Cascia et al., 2000; Lagger et al., 2008; Romdhani and Vetter, 2003], or global colour [Bartoli, 2008].. 2.3.2. Rigid Objects. We can follow two strategies to recover the 3D parameters of a rigid object: 2D Tracking The first group of methods involves a two-step process: first we compute the 2D motion of the object as a displacement of the target projection on the image; second, we recover the actual 3D parameters from the computed 2D displacements by using the scene geometry. A natural choice is to use optical flow : [Irani et al., 1997] computes the dominant 2D parametric motion between two frames to register the images; the residual displacement—the image regions that cannot be registered—is used to recover the 3D motion. When the object is a 3D plane, we can use a homographic transformation to compute plane-to-plane correspondences between two images; then we recover the actual 3D motion of the plane using the camera geometry [Buenaposada and Baumela, 2002; Lourakis and Argyros, 2006; Simon et al., 2000]. We can also compute the inter-frame displacements by using linear regressors or predictors, and then we robustly adjust the projections to a target model— using RANSAC—to compute the 3D parameters [Zimmermann et al., 2009]. An alternative method is to compute pixel-to-pixel correspondences by using a classifier [Lepetit and Fua, 2006], and then recover the target 3D pose using POSIT [Dementhon and Davis, 1995], or equivalent methods [Lepetit et al., 2009]. 3D Tracking These methods directly compute the actual 3D motion of the object from the image stream. They mainly use a 3D model of the target to compute the motion parameters; the 3D model contains a priori knowledge of the target that improves the estimation of motion parameters (e.g. to get rid of projective ambiguities). The simplest way to represent a 3D target is using a texture model —a set of image patches sensed from one or several reference images—as in [Cobzas et al., 2009; Devernay et al., 2006; Jurie and Dhome, 2002b; Masson et al., 2004; Sepp and Hirzinger, 2003; Xu and Roy-Chowdhury, 2008]. The main drawback of these methods is the lack of robustness against changes in scene illumination, specular reflections. We can alternatively fit the projection of a 3D wireframe model (e.g. a cad model) to the edges of the image [Drummond and Cipolla, 2002]. However, these methods have also problems with cluttered backgrounds [Lepetit and Fua, 2005]. To gain robustness, we can use hybrid models of texture and contours such as [Marchand et al., 1999; Masson et al., 2003; Vacchetti et al., 2004], or simply use an additional model to deal with illumination [Romdhani and Vetter, 2003]. 17.

(40) 2.3.3. Nonrigid Objects. Tracking methods for nonrigid objects fall in the same categories that we used for rigid ones. Point-to-point correspondences of the deformable target can recover the pose and/or deformation parameters using subspace methods [Del Bue, 2010; Torresani et al., 2008], or fitting a deformable triangle mesh [Pilet et al., 2008; Salzmann et al., 2007]. We can alternatively fit the 2D silhouette of the target to a 3D skeletal deformable model of the object [Bowden et al., 2000]. Direct estimation of the 3D parameters unifies the processes of matching pixel correspondences, and estimating the pose and deformation of the target. [Brand, 2001; Brand and R.Bhotika, 2001] constrains the optical flow by using a linear generative model to represent the deformation of the object. [Gay-Bellile et al., 2010] models the object 3D deformations, including self-occlusions, by using a set of Radial Basis Functions (rbf).. 2.3.4. Facial Motion Capture. Estimation of facial motion parameters is a challenging task; head 3D orientation was typically estimated by using fiducial markers to overcome the inherent difficulty of the problem [Bickel et al., 2007]. However, markerless methods have been also developed in recent years. Facial motion capture involves recovering head 3D orientation and/or face deformation due to changes in expression. We first review techniques for recovering head 3D pose, then we review techniques for recovering both pose and expression. Head pose estimation There are numerous techniques to compute head pose or 3D orientation. In the following, we review a number of them—a recent detailed survey on the subject is [Murphy-Chutorian and Trivedi, 2009]. The main difficulty of estimating head pose lies on the nonconvex structure of the human head. Classic 2D approaches such as [Black and Yacoob, 1997; Hager and Belhumeur, 1998] are only suitable to track motions of the head parallel to the image plane: the reason is that these methods only use information from a single reference image. To fully recover the 3D rotation parameters of the head we need additional information. [La Cascia et al., 2000] uses a texture map that was computed by cylindrical projection of different point-of-view images of the head; [Baker et al., 2004a; Jang and Kanade, 2008] also use an analogous cylindrical model. In a similar fashion, we can use a 3D ellipsoid shape [An and Chung, 2008; Basu et al., 1996; Choi and Kim, 2008; Malciu and Prêteux, 2000]. Instead of using a cylinder or an ellipsoid, we can have a detailed model of the head like a 3D Morphable Model (3dmm) [Blanz and Vetter, 2003; Muñoz et al., 2009; Xu and Roy-Chowdhury, 2008], an aam coupled together with a 3dmm [Faggian et al., 2006], or a triangular mesh model of the face [Vacchetti et al., 2004]. The latter is robustly tracked in [Strom et al., 1999] using an Extended Kalman Filter. We can also have a head model with reduced complexity as in [B. Tordoff et al., 2002]. 18.

(41) Face expression estimation A change of facial expression induces a deformation in the 3D structure of the face. The estimation of this deformation can be used for face expression recognition, expression detection, or facial motion transfer. Classic 2D approaches such as aams [Cootes et al., 2001; Matthews and Baker, 2004] are only suitable to recover expressions from a frontal face. 3D aams are the three-dimensional extension to these 2D methods: they adjust a statistical model of 3D shapes and texture—typically a PCA model—to the pixel intensities of the image [Chen and Wang, 2008; Dornaika and Ahlberg, 2006]. Hybrid methods that combine 2D and 3D aams show both real-time performance and actual 3D head pose estimation: we can use the 3D aams to simultaneously constrain the 2D aams motion and compute the 3D pose [Xiao et al., 2004b], or directly compute the facial motion from the 2D aams parameters [Zhu et al., 2006]. In contrast to pure 2D aams, 3D aams can recover actual 3D pose and expression from faces that are not frontal to the camera. However, the out-of-plane rotations that can be recovered by these methods are typically smaller than using a pure 3D model (e.g. a 3dmm). [Blanz and Vetter, 2003; Romdhani and Vetter, 2003] search the best configuration for a 3dmm such that the differences between the rendered model and the image are minimal; both methods also show great performance recovering strong facial deformations. Real-time alternatives using 3dmm include [Hiwada et al., 2003; Muñoz et al., 2009]. [Pighin et al., 1999] uses a linear combination of 3D face models fitted to match the images to estimate realistic facial expressions. Finally, [Decarlo and Metaxas, 2000] derives an anthropometric physically-based face model that may be adjusted to each individual face target; besides, they solve a dynamic system for the face pose and expression parameters by using optical flow constrained by the edges of the face.. 19.

(42)

(43) Chapter 3 Efficient Direct Image Registration 3.1. Introduction. This chapter reviews the problem of efficiently registering two images. We define Direct Image Alignment (dia) problem as the process that computes the transformation between two frames using only image brightness information. We organize the chapter as follows: Section 3.2 introduces basic registration notions; Section 3.3 reviews additive registration algorithms such as Lucas-Kanade or HagerBelhumeur; Section 3.4 reviews compositional registration algorithms such as Baker and Matthews’ Forward Compositional and Inverse Compositional; finally, other methods are reviewed in Section 3.5.. 3.2. Modelling Assumptions. This section reviews those assumptions on the real world that we use to mathematically model the registration procedure. We introduce the notation on the imaging process through a pinhole camera. We ascertain the Brightness Constancy Assumption or Brightness Constancy Constraint (bcc) as the cornerstone of the direct image registration techniques. We also pose the registration problem as an iterative optimization problem. Finally, we provide a classification of the existing direct registration algorithms.. 3.2.1. Imaging Geometry. We represent points of the scene using Cartesian coordinates in R3 (e.g. X = (X, Y, Z)⊤ ). We represent points on the image with homogeneous coordinates, so that the pixel position x = (i, j)⊤ is represented using the notation for augmented points as x̃ = (i, j, 1)⊤ . The homogeneous point x̃ = (x1 , x2 , x3 )⊤ is conversely represented in Cartesian coordinates using the mapping p : P2 → R2 , such that p(x̃) = x = (x1 /x3 , x2 /x3 ). The scene is imaged through a perfect pin-hole camera [Hartley and Zisserman, 2004]; by abuse of notation, we define the perspective 21.

(44) Figure 3.1: Imaging geometry. An object of the scene is imaged through camera centres C1 and C2 onto two distinct images I1 and I2 (related by a rotation R and a translation t). The point X is projected to the points x1 = p(K I|0 X̃) and x2 = p(K R − Rt X̃) in the two images.. projection p : R3 7→ R2 that maps scene coordinates onto image points,. x = p(Xc ) =. . ⊤ k⊤ 1 Xc k 2 Yc , ⊤ k⊤ 3 Zc k 3 Zc. ⊤. ,. ⊤ ⊤ ⊤ where K = (k⊤ is the 3 × 3 matrix that contains the camera intrinsics 1 , k2 , k 3 ) (cf. [Hartley and Zisserman, 2004]), and Xc = (Xc , Yc , Zc )⊤ . We implicitly assume that Xc represents a point in the camera reference system. If the points to project are expressed in an arbitrary reference system of the scene we need an additional mapping; hence, the perspective projection for a point X in the scene is. X x̃ = K R − Rt , 1 . where R and t are the rotation and translation between the scene and the camera coordinate system (see Figure 3.1). Our input is a smooth sequence of images—i. e. inter-frame differences are small—where It is the t-th frame of the sequence. We denote T as the reference image or template. Images are discrete matrices of brightness values, although we represent them as functions from R2 to RC , where C is the number of image channels (i.e. C = 3 for colour, and C = 1 for gray-scale images): It (x) is the brightness value at pixel x. For non-discrete pixel coordinates, we use bilinear interpolation. If X is a set of pixels, we collect the brightness values of I(x), ∀x ∈ X in a single column vector as I(X )—i.e., I(X ) = (I(x1 ), . . . , I(xN ))⊤ , {x1 , . . . , xN } ∈ X . 22.

(45) 3.2.2. Brightness Constancy Constraint. The bcc relates brightness information between two frames of a sequence [Hager and Belhumeur, 1998; Irani and Anandan, 1999]. The reference image T is one arbitrary image of the sequence. We define the target region X as a set of pixel coordinates X = {x1 , . . . , xN } defined on T (see Figure 3.2). We define the template as the image values of the target region, that is, T (X ). Let us assume we know the transformation of the target region between T and another arbitrary image of the sequence, It . The motion model f defines this transformation as Xt = f (X ; µt ), where the set of coordinates Xt is the target region on It and µt are the motion parameters. The bcc states that the brightness values of the template T and the input image It warped by f with parameters µt should be equal, T (X ) = It (f (X ; µt )).. (3.1). The direct conclusion from Equation 3.1 is that the brightness of the target does not depend on its motion—i.e., the relative position and orientation of the camera with respect the target does not affect the brightness of the latter. However, we may augment the bcc to include appearance changes [Black and Jepson, 1998; Buenaposada et al., 2009; Matthews and Baker, 2004], and changes in illumination conditions due to ambient [Bartoli, 2008; Basri and Jacobs, 2003] or specular lighting [Blanz and Vetter, 2003].. 3.2.3. Image Registration by Optimization. Direct image registration is usually posed as an optimization problem. We minimize an error function based on the brightness pixel-wise difference that is parameterized by motion variables: µ∗ = arg min{D(X ; µ)2 }, (3.2) µ. where D(X ; µ) = T (X ) − It (f (X ; µ)). (3.3). is a dissimilarity measure based on the bcc (Equation 3.1). Descent Methods Recovering these parameters is typically a non-linear problem as it depends on image brightness—which is usually non-linearly related to the motion parameters. The usual approach is iterative gradient-based descent (GD): from a starting point µ0 in the search space, the method iteratively computes a series of partial solutions µ1 , µ2 , . . . µk that, under certain conditions, converge to the local minimizer µ∗ [Madsen et al., 2004] (see Figure 3.2). We typically use Gauss-Newton (GN) methods for efficient registration because they provide good convergence without computing second derivatives (see Appendix A). Hence, the basic GN-based algorithm for image registration operates as we outline in Algorithm 1 and depict in Figure 3.3. We describe the four stages of the algorithm in the following: 23.

(46) Figure 3.2: Iterative gradient descent image registration. Top-left Template image for the registration. We highlight the target region as a green quadrangle. Topright Image that we register against the template. We generate the image by rotating the image around its centre and translating it in the X-axis. We highlight the corresponding target region in yellow. We also display the initial guess for the optimization as a green quadrangle. Notice that it exactly corresponds to the position of the target region at the template. Bottom-left Contour plot of the image brightness dissimilarity. The axis show the values of the search space: image rotation and translation. We show the successive iterations in the search space: we reach the solution in four steps—µ0 to µ4 . Bottomright We show the target region that corresponds to the parameters of each iteration. The colour of each quadrangle matches the colour of the parameters that generated it as seen in the Bottom-left figure.. 24.

(47) Dissimilarity measure The dissimilarity measure is a function on the image brightness error between two images. The usual measure for image registration is the Sum of Squared Differences (ssd), that is, the L2 -norm of the difference of pixel brightness (Equation 3.3) [Brooks and Arbel, 2010; Hager and Belhumeur, 1998; Irani and Anandan, 1999; Lucas and Kanade, 1981]. However, we can use other measures such as normalized cross-correlation [Brooks and Arbel, 2010; Lewis, 1995], or mutual information [Brooks and Arbel, 2010; Dowson and Bowden, 2008; Viola and Wells, 1997]. Linearize the dissimilarity The next stage linearizes the brightness function about the current search parameters µ; this linearization enables us to transform the problem into a system of linear equations on the search variables. We typically approximate the function using Taylor series expansion; depending on how many terms—derivatives—we compute, we have optimisation methods like Gradient Descent [Amberg and Vetter, 2009], Newton-Raphson [Lucas and Kanade, 1981; Shi and Tomasi, 1994], Gauss-Newton [Baker and Matthews, 2004; Brooks and Arbel, 2010; Hager and Belhumeur, 1998] or even higherorder methods [Benhimane and Malis, 2007; Keller and Averbuch, 2004, 2008; Megret et al., 2008]. This is theoretically a good approximation when the dissimilarity is small [Irani and Anandan, 1999], although the estimation can be improved by using coarse-to-fine iterative methods [Irani and Anandan, 1999], or by selecting appropriate pixels [Benhimane et al., 2007]. Although Taylor series expansion is the usual approach to compute the coefficients of the system, other approaches such as linear regression [Cootes et al., 2001; Jurie and Dhome, 2002a] or numeric differentiation [Gleicher, 1997] may be used. Compute the descent direction The descent direction is a vector δµ in the search space such that D(µ + δµ) < D(µ). In a GN-based algorithm, we solve the linear system of equations of the previous stage using least-squares [Baker and Matthews, 2004; Madsen et al., 2004]. Note that we do not perform the line search stage—i.e., we implicitly assume that the step size α = 1, cf. Appendix A. Update the search parameters Once we have determined the search direction, δµ, we compute the next point in the series by using the update function U : RP 7→ RP : µ1 = U(µ0 , δµ). We compute the dissimilarity value at µ1 to check convergence: if the dissimilarity is below a given threshold, then µ1 is the minimizer µ∗ —i.e., µ∗ = µ1 ; in other case, we repeat the whole process (i.e. µ1 are the actual current parameters µ) until we find a suitable minimizer.. 3.2.4. Additive vs. Compositional. We turn our attention to the step 4 of Algorithm 1: how to compute the new estimation of the optimization parameters. In a GN optimization scheme, the new 25.

(48) Algorithm 1 Outline of the basic GN-based descent method for image registration On-line: Let µi = µ0 be the initial guess. 1: while no convergence do 2: Compute the dissimilarity function at D(µi ). 3: Compute the search direction: linearize the dissimilarity and compute the descent direction, δµi . 4: Update the optimization parameters:µi+1 = U(µi , δµi ). 5: end while. Figure 3.3: Generic descent method for image registration. We initialize the current parameter estimation at frame It+1 (µ = µ0 ) using the local minimizer at the previous frame It (µ0 = µ∗t ). We compute the Dissimilarity Measure between the Image and the Template using µ (Equation 3.3). We linearize the dissimilarity measure to compute the descent direction of the search parameters (δµ). We update the search parameters using the search direction and we obtain an approximation to the minimum (µ1 ). We check if µ1 is a local minimizer by using the brightness dissimilarity: if D is small enough, then µ1 is the local minimizer (µ∗ = µ1 ); in other case, we repeat the process with using µ1 as the current parameters estimation (µ = µ1 ).. 26.

(49) parameters are typically computed by adding the former optimization parameters to the search direction vector: µt+1 = µt + δµt (cf. Appendix A); this summation is a direct consequence of the definition of Taylor series [Madsen et al., 2004]. We call additive approaches to those methods that update parameters by using addition [Hager and Belhumeur, 1998; Irani and Anandan, 1999; Lucas and Kanade, 1981]. Nonetheless, Baker and Matthews [Baker and Matthews, 2004] subsequently proposed a GN-based method that updated the parameters using composition— i.e., µt+1 = µt ◦ δµt . We call these methods compositional approaches [Baker and Matthews, 2004; Cobzas et al., 2009; Muñoz et al., 2005; Romdhani and Vetter, 2003; Xu and Roy-Chowdhury, 2008].. 3.3. Additive approaches. In this section we review some works that use additive update. We introduce the Lucas-Kanade algorithm, the fundamental work on direct image registration. We show the basic algorithm as well as the common problems regarding the method. We also introduce the Hager-Belhumeur approach to image registration and we point out its highlights.. 3.3.1. Lucas-Kanade Algorithm. The Lucas-Kanade (LK) algorithm [Lucas and Kanade, 1981] solves the registration problem using a GN optimization scheme. The algorithm defines the residuals r of Equation 3.3 as r(µ) ≡ T(x) − I(f (x; µ)). (3.4) The corresponding linear model for these residuals is r(µ + δµ) ≃ ℓ(δµ) ≡ r(µ) + r′ (µ)δµ = r(µ) + J(µ)δµ,. (3.5). where. ∂I(f (x; µ̂) ∂ µ̂ Hence, our optimization process amounts to minimise now r(µ) ≡ T(x) − I(f (x; µ)), and J(µ) ≡. .. (3.6). µ̂=µ. δµ∗ = arg min{ℓ(δµ)⊤ ℓ(δµ)} = arg min{L(δµ)}. δµ. (3.7). δµ. We compute the local minimizer of L(δµ) as follows: 0 = L′ (δµ) = ∇δµ r(µ)⊤ r(µ) + 2δµ⊤ J(µ)r(µ) + δµ⊤ J(µ)⊤ J(µ)δµ = J(µ)r(µ) + J(µ)⊤ J(µ)δµ. Again, we obtain an approximation to the local minimum at −1 δµ = − J(µ)⊤ J(µ) J(µ)⊤ r(µ),. . (3.8). (3.9). which we iteratively refine until we find a suitable solution. We summarize the optimization process in Algorithm 2 and Figure 3.4. 27.