Face Image Synthesis and Interpretation Using
3D Illumination-Based Active Appearance
Models
By
Salvador Eugenio Ayala-Raggi M.Eng., UNAM
Advisor:
Dr. Leopoldo Altamirano-Robles Computer Science Department, INAOE
A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF COMPUTER SCIENCE
AT
NATIONAL INSTITUTE OF ASTROPHYSICS, OPTICS, AND ELECTRONICS
TONANTZINTLA, PUEBLA, MEXICO FEBRUARY 2010
©INAOE 2010
All rights reserved.
The author hereby grants to INAOE the
permission to reproduce and distribute copies of
FACE IMAGE SYNTHESIS AND INTERPRETATION
USING 3D ILLUMINATION-BASED
ACTIVE APPEARANCE MODELS
By
Salvador Eugenio Ayala-Raggi
M. Eng., UNAM
-Advisor:
Dr. Leopoldo Altamirano-Robles
Computer Science Department, INAOE
A Dissertation Submitted in Partial Fulllment of the Requirements for the Degree of
DOCTOR OF COMPUTER SCIENCE
AT
NATIONAL INSTITUTE OF ASTROPHYSICS, OPTICS, AND ELECTRONICS
TONANTZINTLA, PUEBLA, MEXICO FEBRUARY 2010
c
INAOE 2010.
All rights reserved.
The author hereby grants to INAOE the permission to reproduce and distribute copies of this thesis document
Abstract
This work presents an innovative and fast approach for face interpretation invariant to lighting and pose. The presented approach called3D−IAAM (3D Illumination-Based
Active Appearance Model) performs interpretation by tting a parametric 3D face model to an input image using an optimization algorithm. The parameters obtained after the tting process describe the appearance of the face. The tting process is automatic and only requires a 2D position and a scale factor as initialization. The proposed model is a natural 3D extension of active appearance models and is based on modeling, separately and simultaneously, 3D pose, 3D shape, albedo, and lighting. 3D−IAAM is capable of
synthesizing faces with arbitrary 3D shape, 3D pose, albedo and lighting. In order to t the model to an input image, a fast optimization algorithm able to t face images with non-uniform lighting and arbitrary pose is proposed in this thesis. The proposed tting algorithm, based on a gradient descent approach, executes a fast update to the Jacobian by using the lighting parameters estimated in each iteration of the tting process. The optimization method is able to accurately estimate the parameters of 3D shape and albedo, which are strongly related to identity. Experimental results, suggest that our model can be extended to face recognition under non-uniform lighting and variable pose. The main contribution of this thesis is the novel method for face interpretation3D−IAAM based
on analysis by synthesis. The particular contributions derived from this work are: 1. A method for constructing 3D face models from surface meshes estimated by photometric stereo. 2. A deformable model capable of synthesizing face images with arbitrary pose, shape, albedo and lighting. Our face synthesis algorithm can arbitrarily create face images with multiple identities, 3D pose and lighting by varying the value of a compact set of parameters. 3. A novel way to normalize the albedo in terms of illumination parameters. The albedo normalization is applied over an image normalized in pose and shape which has been sampled from the original test image. This normalized face image is used during the tting process in order to be compared with a reference mean face image which evolves in lighting according to the illumination parameters estimated in each iteration. 4. A novel iterative optimization algorithm based on a gradient descent scheme, where the gradient, in this case a Jacobian, is quickly recalculated as a function of the illumination parameters estimated in the last iteration.
Acknowledgements
It is a pleasure to thank those who made this thesis possible. Particularly, I would like to show my gratitude to:
My wife Patricia for her invaluable love, support and patience.
My daughter Fanny for being my inspiration.
My parents because I owe my academical formation to them.
My sister Patricia for her unconditional support.
My advisor Dr. Leopoldo Altamirano Robles, whose encouragement, guidance and support, from the initial to the nal level, enabled me to develop and conclude this work.
The members of my PhD committee: Dr. Jesús Ariel Carrasco Ochoa, Dr. José Francisco Martínez Trinidad, Dr. Gustavo Rodríguez Gómez and Dr. Luis Enrique Sucar Succar from the National Institute of Astrophysics, Optics, and Electronics for the eort in reviewing this thesis document and their valuable recommendations to improve this work.
My external reviewer Dr. Timothy Francis Cootes from The University of Manchester, in Manchester UK, for his detailed and analytical revision and multiple recommendations.
The National Institute of Astrophysics, Optics and Electronics for the support in the completion of this work.
The National Council of Science and Technology (CONACYT) for the scholarship No.67398 granted to this research.
My friend Dr. Janeth Cruz for her guidance and support throughout this research project.
Lastly, my friends and colleagues for their friendship, valuable comments, and suggestions for this work.
Contents
Abstract iii
Contents 1
List of Figures 5
List of Tables 9
Notations 10
1. Introduction 11
1.1. The problem . . . 13
1.1.1. Appearance-Based Models: The problem of mixed appearance at-tributes . . . 13
1.1.2. Face Deformable Models: The way to a full interpretation of faces . 15 1.1.3. Ideal Characteristics for a Face Interpretation Algorithm . . . 18
1.2. Objective . . . 19
1.2.1. Main Objective . . . 19
1.2.2. Particular Objectives . . . 19
1.3. The Solution . . . 20
1.4. Contributions . . . 21
1.5. Document Organization . . . 21
2. Active Appearance Models 23 2.1. Introduction . . . 23
2.2. Statistical Appearance Models: Eigenfaces . . . 24
2.3. Active Shape Models: ASMs . . . 24
2.3.1. Appearance Model for Shape Alignment . . . 27
2.4. Active Appearance Models . . . 29
2.4.1. AAMs with Independent Parameters of Shape and Appearance . . 30
2.5. Alignment of an AAM . . . 35
2.5.1. Minimizing Residuals . . . 35
2.5.2. Iterative Alignment Algorithm . . . 36
2.6. Conclusion . . . 36
3. Techniques for Face Interpretation 39 3.1. Introduction . . . 39
3.2. Approaches Based on 3DMM: Morphable Models . . . 39
3.3. Approaches Based on 3DMM and Lighting Models . . . 40
3.4. Approaches Based on Active Appearance Models . . . 41
3.5. Approaches based on 3D Active Appearance Models . . . 42
3.6. Approaches Which Consider Lighting . . . 43
3.7. Conclusions . . . 44
4. 3D Illumination-Based Active Appearance Models 45 4.1. Introduction . . . 45
4.2. Modeling Lighting . . . 46
4.2.1. Forcing the lighting Model to be Positive . . . 49
4.3. Piecewise Ane Warping . . . 51
4.4. Constructing a 3D Illumination-Based Active Appearance Model (3D-IAAM) 53 4.4.1. Construction of a Bootstrap set of Surfaces and Albedo Maps . . . 53
4.4.2. Constructing the Models of Shape and Albedo . . . 57
4.5. Synthesizing Faces with Novel Appearances . . . 59
4.6. Face Alignment Using the 3D-IAAM Model . . . 63
4.6.1. Introduction . . . 63
4.6.2. Overview of the Iterative Fitting Process . . . 64
4.6.3. Pose and shape normalization . . . 65
4.6.4. Albedo normalization . . . 65
4.6.5. Modeling the Residuals . . . 68
4.6.6. Iterative Fitting Algorithm . . . 69
4.7. Time Complexity Analysis of the Fitting Algorithm . . . 73
4.8. Summary . . . 74
5. Experimental Results 75 5.1. Construction of a Bootstrap Set of Face Surfaces . . . 76
5.1.1. Individuals Used . . . 76
5.1.2. Constructing a Set of Face Surfaces . . . 76
5.2. Some Details about the Construction of the Model . . . 78
5.3. Face Alignment with a Constant and an Adaptive Jacobian . . . 79
5.3.1. Construction of the set of test images . . . 79
3
5.3.3. Testing the tting algorithm for illumination number 1,
(illuminat-ing the left side of face) . . . 81
5.3.4. Testing the tting algorithm for illumination number2(illuminating the right side of face) . . . 83
5.3.5. Testing with random faces . . . 87
5.4. Convergence and Recovery of Shape and Albedo From Real Images . . . . 88
5.4.1. Introduction . . . 88
5.4.2. Setup . . . 88
5.4.3. Measuring the Convergence of the tting algortihm . . . 90
5.4.4. Recovery of 3D Shape and Albedo and a Measure of Quality Through Identication . . . 91
5.4.5. Novel Views of Fitted Faces . . . 96
5.4.6. Discussion About this Section . . . 97
5.5. Face Alignment Using a Training Set of 20Individuals (Set B) . . . 99
5.5.1. Mean execution time for computations of the adaptive Jacobian . . 101
5.5.2. Discussion . . . 101
5.6. Face Alignment of Faces not Included into the Training Set: Fitting Novel Faces . . . 105
5.6.1. Discussion . . . 106
5.7. Conclusion . . . 106
6. Conclusions and Future Work 113 6.1. Future Work . . . 114
A. Glossary of Terms 117
List of Figures
1.1. Analysis by Synthesis . . . 13
1.2. Coding identity and pose within a single linear subspace . . . 15
1.3. Coding identity and pose within linear subspaces . . . 16
1.4. A few instances with dierent illuminations only for a single pose . . . 17
2.1. Eigenfaces . . . 24
2.2. ASM training process . . . 25
2.3. Shape models alignment . . . 26
2.4. Eect of varying the rst three shape parameters for a hand model (Reproduced from Cootes et al. [11]). . . 26
2.5. ASM tting algorithm . . . 27
2.6. Fitting an ASM to a test image (Reproduced from [9]). . . 27
2.7. ASM proles correlation . . . 28
2.8. ASM landmark is moved to the best location . . . 29
2.9. ASM landmarks are moved to the best location . . . 29
2.10.AAM linear shape model . . . 31
2.11.AAM linear appearance model . . . 32
2.12.AAM instantiation . . . 33
2.13. Fitting an AAM (Reproduced from Cootes et al. [11]). . . 34
3.1. 3DM M morphable model . . . 40
4.1. Spherical coordinates. Radius is innity. . . 46
4.2. 9PL light points distributions . . . 50
4.3. Ane warp . . . 52
4.4. Computing individual texture mappings for all triangles over the face . . . 54
4.5. The same point on the surface but illuminated by three dierent incident light sources. . . 55
4.6. 11 dierent images of the same individual but under dierent illuminations can be used to estimate the face surface by using the photometric stereo method. . . 57
4.7. Three examples of surfaces obtained with our surface estimator. This method of surface estimation produces realistic reconstructions that are
alike the actual surfaces. . . 57
4.8. Shapes alignment . . . 58
4.9. Shape-normalization of albedo images . . . 59
4.10. Placing a new albedo over a new shape . . . 60
4.11. The mean map of surface normals is deformed to t within the new shape . 61 4.12. Lighting synthesis . . . 62
4.13. Orthographic projection. . . 63
4.14. Giving a new pose to the new synthetic face . . . 63
4.15. Face synthesis. . . 64
4.16. Residual estimation during a step of the tting process . . . 67
4.17. Fitting algorithm . . . 71
4.18. Two consecutive iterations of the tting process . . . 72
4.19. Evolution of the synthetic face produced by the model during the tting process, from initialization to convergence. . . 73
5.1. Set A composed by 10identities from the Yale B database. . . 76
5.2. Set B composed by20individuals selected from Yale B and extended Yale B Databases. . . 76
5.3. Surfaces 1 to 10 of the set A . . . 77
5.4. Surfaces 11 to 20 of the set B . . . 78
5.5. Face image showing the distribution of the 50 landmarks used in shape modeling . . . 79
5.6. Reconstructions using an adaptive and a constant Jacobian . . . 82
5.7. Measurements performed using a constant Jacobian and illumination number 1 . . . 83
5.8. Measurements performed using an adaptive Jacobian and illumination number 1 . . . 84
5.9. Reconstructions using an adaptive and a constant Jacobian . . . 85
5.10. Measurements performed using a constant Jacobian and illumination number 2 . . . 86
5.11. Measurements performed using an adaptive Jacobian and illumination number 2 . . . 87
5.12. Identity number 1 (from Yale B database) illuminated by each one of the nine basis light sources. These images were synthesized by using our model. 87 5.13. Left:5aligned faces with dierent illuminations and poses using an adaptive Jacobian matrix. Right: estimated poses. . . 88
5.14. Face alignments over face images of identities 1 to 5 with the illuminations L1 and L6 . . . 90
7
5.15. Face alignments over face images of identities 6 to 10 with the illuminations L1 and L6 . . . 91 5.16. Alignments for identity 7 with each one of the 6dierent illuminations . . 92
5.17. Evolution of RM S error in intensity dierence . . . 92
5.18. Evolution of the tting for identity 7 and lighting 6 in 12 iterations of the algorithm . . . 93 5.19. Average (over the 10 identities) of the identity likelihood measured between
estimated and ideal parameters . . . 94 5.20. Identication rates for each one of the six illuminations . . . 95 5.21. Average (over the 6 illuminations) of the identity likelihood measured
between estimated parameters and ideal parameters of each one of the 10 identities. The tting process was performed over six images with dierent illumination of each identity (1 to 10) . . . 96 5.22. Novel views of aligned faces by modifying pose and illumination parameters.
The alignments were performed over identities 1 to 5 . . . 97 5.23. Novel views of aligned faces by modifying pose and illumination parameters.
The alignments were performed over identities 6 to 10 . . . 98 5.24. Face alignments over face images of identities 1 to 5 with the illuminations
L1 and L6 . . . 100 5.25. Face alignments over face images of identities 6 to 10 with the illuminations
L1 and L6 . . . 101 5.26. Face alignments over face images of identities 11 to 15 with the
illumina-tions L1 and L6 . . . 102 5.27. Face alignments over face images of identities 16 to 20 with the
illumina-tions L1 and L6 . . . 103 5.28. Identication rates for each one of the six illuminations . . . 104 5.29. Reconstructions of two individuals from set B under two dierent lightings 105
5.30. Mean execution times for dierent quantities of computations of the adaptive Jacobian. This experiment was performed on a Pentium IV computer with 2.4Ghz of speed and 1GB in RAM memory . . . 106
5.31. Alignments for the novel face number 18from the extended Yale B database107
5.32. Alignments for the novel face number 25from the extended Yale B database108
5.33. Alignments for the novel face number 35from the extended Yale B database109
5.34. Alignments for the novel face number 36from the extended Yale B database110
List of Tables
5.1. Mean values of measurements performed for the alignments of faces under illumination number1 . . . 83
5.2. Mean values of measurements performed for the alignments of faces under illumination number2 . . . 86
5.3. Illuminations used for experiments . . . 88
Notations
Common Notations
a Matrices and Vectors are denoted by bold letters. AT Transpose of the matrixA.
I(x, y)or I(x) An image expressed as a intensity function of each pixel. The image intensity is a quantity specied in grey levels.
I An image expressed as a column vector with each element being a grey level value.
Iλ Albedo image expressed as a function of pixelx(x= (x, y)T).
λ Albedo image expressed as a column vector.
3D IAAM Notations
c Column vector whose elements are shape parameters. a Column vector whose elements are albedo parameters. L Column vector whose elements are lighting parameters.
T Column vector whose elements are translation parameters(tx, ty)T.
R Column vector whose elements are rotation angles(β, α, θ)T.
s Scalar value denoting a scale factor.
W(x;c) Warping function. Returns a new pixel location x0 = (x0, y0)T for
the original location x = (x, y)T. In practice, individual warping
functions are used for each triangle dened within the shape. For notation simplicity, we use a single warping function W which
covers all the triangles dened over the face. The inclusion ofcas part of the function parameters refers to the fact that the warping function depends on the shape parameters.
Itarget(x)←−Isource(W(x;c)) Represents a texture mapping from a source image Isource to a
target image Itarget. The mapping is determined by the warping
functionW. In practice, the pixels within the target imageItarget
are scanned one at the time (represented byx) and a texture value is picked up from the source image according to the pixel location
Chapter 1
Introduction
Automatic, fast and full interpretation of face images under variable conditions of lighting and pose is one of the more exciting and unsolved problems in computer vision. Interpretation is the inference of knowledge from an image. This knowledge covers relevant information, such as 3D shape and albedo, both related to the identity, but also information about physical factors which aect appearance of faces, such as pose and lighting. Interpretation of faces not only should be limited to retrieving the aforementioned pieces of information, but also, it should be capable of synthesizing novel facial images in which some of these pieces of information have been modied. This kind of interpretation can be achieved by using the paradigm known as analysis by synthesis, see Figure 1.1. Ideally, an approach based on analysis by synthesis, should consist of a generative facial parametric model that codes all the sources of appearance variation separately and independently, and an optimization algorithm which systematically varies the model parameters until the synthetic image produced by the model is as similar as possible to the test image, also called input image. A full interpretation approach should include the recovery of 3D shape, 3D pose, albedo and lighting from a single face image which exhibits any possible combination of these sources of variation.
Active appearance models, or simply AAM s, with respect to other approaches,
represent a fast alternative to perform face interpretation using the analysis by synthesis paradigm.
Texture and shape, are attributes modeled by AAM s by using statistic tools such as
principal components analysis or shortly P CA. However, the apparent texture of a face
is an implicit combination of lighting and albedo. Separating these two attributes is not an easy task into the context of sparse models, like AAM s. AAM s use a sparse set of
vertices dening the shape. Texture is interpolated over that shape. In fact, a detailed dense set of surface normals, which is not present in AAM s, is required to perform the
separation of lighting and albedo.
On the other hand, texture and shape variation among human faces is relatively small
when uniform lighting is considered. AAM s take advantage of this fact by supposing
a constant relationship between changes of appearance and the variation of the model parameters producing those changes. This approximately constant relationship is used in the form of a constant gradient which is used for performing fast tting to input images. However, for most purposes, lighting is not uniform, and a proper separation of albedo and lighting becomes necessary.
In a similar way as is texture variation in uniform lighting, albedo variation among human faces is small. In contrast to albedo, lighting is not necessarily constrained to a small variation range. In fact, lighting aects appearance more than identity and pose, and presents many degrees of freedom.
During a tting process, an initial model is gradually modied in each iteration up to looking alike to the input image. Therefore, if the lighting of the input image is too dierent from that of the initial model, the ratio of appearance variation with respect to the parameters variation can not be the same during all the iterations of the tting process.
For instance, if we have a model with a pronounced left illumination, and a model with uniform illumination, the change of appearance caused by an increase on one of the model parameters, for example the parameter of scale, is not the same in both cases. This ratio of appearance variation with respect to the model parameters is in fact a Jacobian whose value changes in each iteration.
Therefore, if we want to t an AAM to a face with any kind of lighting, a constant
Jacobian is not the solution. On the other hand, recomputing the Jacobian in each iteration is an expensive computational task [14],[30].
In this thesis, we present an innovative 3D extension ofAAMs based on an illumination
model. By using interpolation, we incorporate a dense set of surface normals to our sparse 3D AAM model. In this way, we can model lighting within the process of synthesizing
faces, and also within the optimization process used for tting the face model to an input image.
We propose a tting method based on an inexpensive way for updating the Jacobian in accordance to the illumination parameters recalculated in each iteration. Our method is able to encode separately four of the more relevant sources of appearance variation: 3D shape, albedo, 3D pose and lighting.
Our approach estimates 3D shape, 3D pose, albedo, and illumination simultaneously during each iteration. Since our model uses analysis by synthesis, it has an inherent ability of adaptation to the input image. Adaptation is a desirable characteristic because it oers the possibility of designing person-independent face interpretation systems. Experimental results show that the proposed approach not only can be extended to face recognition, but also demonstrate its ability for tting to novel faces and performing interpretation.
Applications to this kind of interpretation algorithms are many, such as face interpretation or recognition with robustness to lighting and pose, face and head tracking
1.1. THE PROBLEM 13
in video sequences, human-machine interfaces, algorithms aimed to obtain high levels of data compression in video conferencing, etc.
This thesis proposes a novel way to cope with an important source of appearance variation which aects signicatively face images: lighting. We anticipate that our approach can be extended to face recognition under dicult conditions of lighting and can be generalized to the analysis and recovery of other types of sources of appearance variation such as age, gender, expression, etc., where lighting interferes seriously in the analysis process.
Figure 1.1: Schematized ow of the analysis by synthesis approach
1.1. The problem
1.1.1. Appearance-Based Models: The problem of mixed
appear-ance attributes
The earlier approaches dedicated to face image interpretation are based on extracting features in order to construct models and perform interpretation or recognition [50]. These features are extracted by using edge detectors and other several techniques of image processing. The characteristics of these extracted patterns are given to classiers in order to perform interpretation. The problem with this kind of method is the necessity of encoding human knowledge about what constitutes a typical face. This knowledge is encoded in the form of rules or templates. One problem with those approaches is the diculty to translate human knowledge into well dened rules.
In contrast to this type of method, a more recent set of techniques based on appearance have been proposed. In appearance-based methods, visual patterns are automatically learned from exemplar images. There are no human experts dening patterns. Instead of that, statistical techniques are used to extract relevant features from a big set of exemplar images which belong to the same class, in this case, faces.
Images of faces can be represented as high-dimensional pixel arrays. The task of classifying an attribute of face images such as pose, identity, expression, etc., can be an
expensive computational task when we work in the original high dimensional face space. Usually the face space represents a search space whose size is proportional to the desired resolution of the attribute that we want to detect. For instance, if we want to recognize identity, this search space grows with the number of identities to recognize. Fortunately, face images often belong to a manifold of intrinsically low dimension. The existent high correlation between faces, or between instances of the attribute to be classied (for instance, identity) makes possible to reduce the dimensionality of the problem by applying some statistical techniques of linear dimensionality reduction, such as P CA (Principal
Component Analysis) [45] or LLE [39] (Local Linear Embedding). The result is a lower
dimensional linear subspace where the task of classication is simpler. For instance, Turk and Petland in [41] used a single linear subspace to recognize identity in frontal face images with the same size and uniform expression and illumination.
However, encoding only a single attribute of face images produces an inaccurate recog-nition when other dierent sources of appearance variation, such as pose, illumination, expression, etc., exist simultaneously. For example, Murase and Nayar in [32] proposed to encode pose and identity into a single linear subspace. The quantity of exemplar images increases exponentially with each new source of appearance variation that we add. The location of all these exemplar face images contained within the multidimensional space does not lie on a linear manifold. Linear methods for dimensionality reduction create a linear approximation from the original multidimensional space but with a lower dimensionality. Some phenomena aecting appearance such as 3D pose, illumination, etc., produce a non-linear manifold in the multidimensional face space. Therefore, a linear approximation is not always suitable for modeling a non-linear phenomenon. In fact, when several sources of appearance variation are encoded within a single linear subspace, the recognition is inaccurate, and the best reconstruction of the test image using this low dimensional linear subspace is blurry and also inaccurate.
Figure 1.2 a) illustrates reconstructions (right) of the input image (left) using the subspaceS. Figure 1.2 b) shows a non-linear manifold into the original face space produced
by multiple instances with four dierent identities and seven poses for each one. Figure 1.3 illustrates the set of exemplar images with dierent poses and identities. All face images are encoded within a single linear subspaceS. All faces with dierent identities but with
the same pose in each column (in the same Figure 1.3) could be used to encode identity for a specic pose within local linear subspaces Si, where irepresents a particular pose.
Partitioning the problem by encoding each pose within local linear subspaces, certainly produces more accurate results in recognition and reconstructions [24], but also it increases the size of the search space.
Thus, the mentioned early appearance approaches face three important problems when a single linear subspace is used to encode several sources of non-linear appearance variation. The rst one is that they require too many exemplar images, and the required number of these images is the number of all possible combinations among the
1.1. THE PROBLEM 15
Figure 1.2: a) Reconstructions of the input images (left) with a single subspace (right). b)S represents
a single subspace spanning the manifold, and S1, .., S6 local subspaces for each pose (Reproduced from Shakhnarovich et al. [43]).
dierent sources of appearance variation. Therefore, the number of exemplar images is exponentially proportional to the number of sources of appearance variation to model. For instance, to include illumination in addition to pose and identity, the number of possible instances could increase indenitely because there is not a limit to the number of dierent kind of possible illuminations. Figure 1.4 shows only a few examples of how a frontal face can be illuminated. The second one is that non-linear appearance variation, caused by phenomena such as pose, is not well modeled by a low-dimension linear subspace. Finally, the third problem is the inaccuracy in the interpretation and the reconstruction of the image to be interpreted. A proper reconstruction of the original image using the subspace, and the possibility of synthesizing novel views of the recovered face, would be desirable functions. Unfortunately, these functions are dicult to implement by using approaches based only on appearance.
1.1.2. Face Deformable Models: The way to a full interpretation
of faces
Modeling the appearance variation of faces can be improved by considering that the global appearance of a face image is in fact a combination of individual sources of appearance variation. The appearance-based methods described before use many exemplar images where the sources of appearance variation (pose, shape, texture, illumination) are mixed. There is not a real separation of these sources.
The more recent approaches for face interpretation have addressed the problem of modeling the non-linear variation of appearance by combining linear subspaces with
Figure 1.3: Multiple instances with dierent poses and identities can be represented by a single linear subspaceS. The subspaceScan be used for a fast but not accurate reconstruction. On the other hand, a
more accurate but slower reconstruction can be achieved by using a dierent linear subspaceSi for each
pose (Reproduced from Moghaddam et al. [31]).
deformable structures. The sources of appearance variation are separated and encoded within independent linear subspaces. For example, textures of exemplar images are reshaped and aligned to the same size in order to be encoded within a texture subspace. Similarly, the dierent shape of faces is represented as a set of deformations of a exible grid. This set of dierent shapes is encoded within a shape subspace.
Appearance-based deformable models modify actively their structure for adapting or tting to face images. In this kind of approach, the sources of appearance variation can be encoded separately and independently into the model.
The literature in this eld presents two relevant approaches based on the combination of appearance-based methods and deformable models. They are 3D morphable models (3DMMs) and active appearance models (AAMs), both of them aimed to perform face interpretation using the paradigm of analysis by synthesis. In 3DMMs, face surface is represented with a dense set of points, and reectance is modeled for each individual point by using its surface normal and the vector representing the direction of lighting. 3DMMs can handle only directed light when they are used to perform face interpretation in the analysis by synthesis modality. And they are computationally expensive because the huge quantity of vertices needed for a dense surface representation.
1.1. THE PROBLEM 17
Figure 1.4: A few instances with dierent illuminations only for a single pose
For fast face interpretation, sparse models like AAMs represent a suitable solution because only handle the position of a reduced set of landmarks. The original AAM approach fails in the convergence of the tting process when lighting used in the training set is dierent from lighting present in the test image. In fact, lighting is not handled by the original AAM approach. AAMs handle texture and shape statistically according to a training set. However, in contrast to texture, there is not a proper way to model lighting statistically. Lighting is not limited to a dened range and presents many degrees of freedom.
The problems related to the inclusion of lighting within sparse models like AAMs, begin from the face synthesis process. The 3D surface of faces is required to be densely sampled in order to allow an accurate computation of the surface normals. It is dicult for a sparse 3D shape model to accurately separate the shading from the texture. It would then be dicult to re-illuminate a facial image with a dierent lighting conguration. Therefore, modeling lighting with a sparse model is problematic if we do not have the surface normals of all the points over the face surface.
Regarding optimization, AAMs require of iterative algorithms based on gradient
descent. In order to achieve a fast convergence process, a constant relationship is assumed between the variation of residuals (obtained from subtracting the model from the input image) and the variation of the model parameters. This constant relationship is expressed by a mean Jacobian estimated from all possible textures and shapes belonging to the
training set. The constant Jacobian works properly because texture and shape have a small and limited variation. However, the range of all possible illuminations is not limited and we can not include lighting in the computation of a mean constant Jacobian. Therefore, the inclusion of lighting within the synthesis process, but more important, within the optimization algorithm, is a challenging problem which this thesis work is aimed to solve.
With regard to these problems, we formulate the following research question:
Is it possible to create a 3D active appearance model able to synthesize face images with arbitrary shape, pose, albedo and lighting?. Is it possible to develop a fast and automatic optimization algorithm, capable to recover 3D pose, 3D shape, albedo and lighting from a single face image by using the mentioned 3D active appearance model?
1.1.3. Ideal Characteristics for a Face Interpretation Algorithm
In order to establish criteria for comparing the dierent existent tting algorithms, and according to [37], we have to say that an ideal tting algorithm should fulll the following characteristics: Accuracy, Eciency, Robustness, Automatic behavior and Applicability.
Accuracy: It refers to the accuracy with which the face estimate approximates to the test image. Some common methods for measuring accuracy are, for instance, the norm of the residual between both images (root mean square error or simplyRM S
error). This measure is easy to implement but its value not always indicates the kind of similarity of our interest. For instance, a pair of dark face images of two dierent individuals will produce a lowRM S value although they can be completely dierent.
Another measure, meaningful for a human user, is the visual comparison of the input photograph and the synthetic image rendered by using the tting parameters. If the model is 3D, then it is more informative to render novel views of the same tted face by varying some of the pose model parameters and to compare it with the same real photographed face. It is important to know how the tting algorithm estimates the third dimension. Finally, a more suitable and indirect way which oers an optimal tradeo is to measure the quality of the tting by using a quantitative application such as face identication (as we do with our proposed tting method in Chapter 5).
Eciency: The computational eort depends on the complexity of the algorithm. Robustness: A good tting algorithm should be invariant to non-Gaussian residual. Usually the norm of the residual is utilized by the optimization procedure to minimize its value in each iteration. However, non contemplated artifacts in the test image such as glasses, hair or specular highlight artifacts can aect the convergence process.
1.2. OBJECTIVE 19
Automatic behavior: The tting should require as little human intervention as possible.
Applicability: The type of facial images that are suitable to be tted by the algorithm. There are tting algorithms that only t to images with an specic pose, for example frontal pose. Similarly, there are tting algorithms able to model only directed light but not diuse lighting. Many algorithms are designed to estimate only 2D shape but not 3D structure, etc.
Current methods for face interpretation, based on analysis by synthesis, do not fulll all these features at the same time. For instance, original Active Appearance Models fails in applicability, because they model only the 2D appearance of the third dimension. Recent extensions of AAMs include lighting models but only for recovering 2D appearance and
not 3D shape. On the other hand, 3DM M models fail in eciency because they have
a high computational load. They also fail in automatic behavior, because they require a manual initialization in pose before starting the tting process. Finally, they do not have a fully range of applicability because are unable to model diuse lighting.
This thesis proposes a novel way to cope with an important source of appearance variation, which aects signicatively face images, we refer to illumination. We know that illumination aects the appearance of a face in a greater degree than other factors such as pose, shape, expression and identity [5], [33].
1.2. Objective
1.2.1. Main Objective
The work presented in this thesis aims to develop a fast, accurate, automatic and robust algorithm capable of estimating simultaneously 3D shape, 3D pose, albedo and lighting of a face by tting a parametric deformable model which works using the approach known as analysis by synthesis. Because active appearance models present an approach inherently fast and ecient for tting faces, we have based our work on this kind of models, and we have developed a natural 3D extension of 2DAAM models which is able
to t face images with non-uniform conditions of lighting and non-frontal poses.
1.2.2. Particular Objectives
The specic goals of this work are:
To nd a positive illumination linear subspace able to represent every type of illumination. This subspace should produce a positive image when we take linear combinations of the basis images using positive weights
To develop a parametric 3D shape model of a face based on a sparse set of landmarks. Create a exible map of surface normals and a exible map of albedos capable to adapt to the shape model. Surface normals and albedos are necessary for the synthesis of basis reectance images
To develop a gradient descent optimization algorithm based on improving the gradient estimate by using the illumination parameters estimated in each iteration To develop a quantitative technique for measuring the performance of the tting results
Our approach for face interpretation should be capable of using the recovered model parameters for synthesizing novel and unseen views of the same face, but with dierent conditions of lighting and pose
1.3. The Solution
This work, presents an innovative extension of original active appearance models. Our method is able to encode separately four of the more relevant sources of appearance variation: 3D shape, albedo, 3D pose and lighting. The model presented here is simple and fast because it uses an ecient tting algorithm based on solving a similar problem in each iteration. This is achieved by normalizing, not only in pose and shape, but also in albedo, a sampled region into the test image. This albedo normalization is computed using the illumination parameters estimated in the last iteration. These parameters are also used to relight the reference model and for updating the Jacobian. In each iteration, our tting algorithm computes a residual image from the dierence of the sampled normalized region and the reference model. Using the residual and the updated Jacobian, it is posible to calculate the suitable increments for model parameters. Our approach matches 3D shape, 3D pose, albedo, and illumination simultaneously during each iteration. Our model works in an analysis by synthesis fashion. This approach of doing interpretation has an inherent ability of adaptation to face images. Adaptation is a favorable characteristic of our method, because it brings the possibility of designing a generic person-independent face interpretation system, by selecting a suitable set of training faces. The experimental results show an improvement in pose estimation when an adaptive Jacobian is used instead of using a xed one. In fact, pose cannot be estimated appropriately using a constant Jacobian because the algorithm does not reach the convergence in some cases. On the other hand, in order to measure the ability of our algorithm to align face images and measure the quality of the recovered parameters, we have designed a set of experiments consisting in comparing the recovered parameters with the stored set of parameters belonging to each one of the identities used for training the model. We have presented these results by means of identication rates which demonstrate the ability of the algorithm for recovering
1.4. CONTRIBUTIONS 21
parameters under each one of the used lightings. We conclude that our proposed approach, essentially designed for performing 3D face alignment and model parameters recovery, could be extended, in a future work, to face recognition with certain degree of invariance to pose and lighting.
1.4. Contributions
The main contribution of this work is a novel method for face interpretation called
3D−IAAM (3D Illumination-based Active Appearance Models) based on analysis by
synthesis. As part of our approach for pose and lighting invariant face interpretation we have to highlight four important contributions:
1. A method for constructing 3D face models from surface meshes estimated by photometric stereo.
2. A deformable model capable of synthesizing face images with arbitrary pose, shape, albedo and lighting. Our face synthesis algorithm can arbitrarily create face images with multiple identities, 3D pose and lighting by varying the value of a compact set of parameters.
3. A novel way to normalize the albedo in terms of illumination parameters. The albedo normalization is applied over an image normalized in pose and shape which has been sampled from the original test image. This normalized face image is used during the tting process in order to be compared with a reference mean face image which evolves in lighting according to the illumination parameters estimated in each iteration.
4. A novel iterative optimization algorithm based on a gradient descent scheme, where the gradient, in this case a Jacobian, is quickly recalculated as a function of the illumination parameters estimated in the last iteration.
1.5. Document Organization
This document is organized as follows: Chapter 2 talks about traditional active
ap-pearance models. Chapter3is a compendium of techniques for face interpretation related
to the models proposed in this thesis. Chapter 4 describes the lighting fundamentals on
which we base our work, and our proposed method for face synthesis and interpretation. Chapter5shows the set of experiments carried out for validating our method, and nally,
Chapter 2
Active Appearance Models
2.1. Introduction
In this chapter, the Active Appearance Models, or AAMs, originally proposed in
[14],[12],[13] for synthesis and tting are reviewed. AAMs are based on an earlier
proposed technique known as active shape models, or ASMs, which are designed for
face alignment by adjusting the contour of the initial face model to a test image by using the statistics of the pixel intensities around the edges of the main features of faces. Active Appearance Models are generative parametric models capable of synthesizing face images. By estimating a compact set of basic modes of variation from a large training set, it is possible to adjust the model parameters in order to t an image synthesized by the model to a novel test image, and hence perform image interpretation. Usually, 2D shape and pixel intensities are the properties modeled by the originalAAM approach. A large set of
dierent face images are used for training the model. These images are marked manually with landmarks located over the more relevant features common in most of the faces. Then, principal component analysis, or P CA, is applied separately to the set of dierent
shapes and to the set of dierent textures in order to nd a linear subspace with lower dimensionality than the original spaces of shape and texture. Eigenf aces is the main
tool used by AAMs to model texture. Originally, AAMs were 2D and model 2D shape
and texture contained inside the shape. Our work makes a step further by modeling the appearance of a face as aected by two factors: identity and physical medium. Identity is composed by two properties: 3D shape and albedo. On the other hand, physical medium is composed of two other properties: lighting and spatial pose (Translation, Rotation and Scaling).
2.2. Statistical Appearance Models: Eigenfaces
Eigenfaces is an important statistical method for face recognition proposed by Kirby and Sirovich [26] and Turk and Petland in [41] who applied PCA to large sets of face images. This approach uses a large set of frontal face images with frontal pose and uniform lighting as training set. A facial image can be encoded with a low number of coecients by projecting it to the orthogonal subspace obtained byP CA. A linear combination of the
eigenfaces which form the orthogonal subspace, can synthesize any of the training faces, see Figure 2.1. The obtained coecients can be used for identication. The problem with this approach is that face images are not aligned in shape or physical features, that is, for example an eye belonging to an individual could not coincide in location to the corresponding eye of other individuals. This lack of correspondence between features provokes a blurred mean face image and blurry synthesized faces when the number of eigenfaces used is reduced. Besides, the dierent sources of variation of faces are not separated like pose, illumination, expression, etc.. They are all modeled by one set of parameters that also encodes identity. Therefore, this method suers from a limited generalization capabilities across poses and light directions. It provides limited recognition performance.
Figure 2.1: From a large set of training face images, a small set ofeigenf aceimages, obtained byP CA,
can be arranged into a compact matrix Qe. A linear combination of these basis images plus the mean
face X¯ can reconstruct each one of the original training images (Part of this gure is reproduced from Turk et al. [41]).
2.3. Active Shape Models:
ASM
s
Active shape models, or ASMs, are the predecessors of the AAMs. In ASMs, the 2D
shape of faces is modeled statistically. An ASM deforms iteratively its shape to t to
2.3. ACTIVE SHAPE MODELS: ASMS 25
requires labeled training images to train the model. Landmarks represent correspondences and are placed manually over preestablished common features of the objects belonging to the same class.
In the ASM model, shape variations are synthesized by the shape model which is
obtained applying PCA to the training set. Figure 2.2 illustrates the training process.
ASMs can be aligned to human face images.
Figure 2.2: The training process of anASM. Left: A set of training images are labeled with landmarks.
Middle: Dierent shapes for dierent individuals are represented by the landmarks. Right: As a previous step to make a statistical study of shape variation, shapes have to be aligned (Reproduced from [9]).
Training the ASM: If we have a population of m shape instances, and each instance
contains n reference points, we can dene a shape vector for each shape i as xi =
(x1, y1, x2, y2..., xn, yn)T.
We have to align each one of the instances xi to an initial instance x0, in such a way
that the distance D=kxi−x0 k is minimized.
This task allows to align all the instances to a common reference shape eliminating translation, rotation and scaling. Figure 2.3 shows this process. The mean shape ¯x and
the covariance matrix S of the shape vector, are obtained by using Equations 2.3.1 and
2.3.2, respectively.
¯
x= 1
m m
X
i=1
xi (2.3.1)
S = 1
m−1 m
X
i=1
Figure 2.3: Aligning the shape models. We are interested on modeling shape variations caused by dierences across individuals. So, before performing the study of shape variation, we have to eliminate the dierences caused by rotation, translation and scaling. Left: After labeled with landmarks, shapes (or shape models) use to be unaligned. Right: We minimize the distance between corresponding landmarks by translating, rotating, and scaling the shapes (Part of this gure is reproduced from [9]).
The rstteigenvectorsφi,i= 1, ..., tand their corresponding eigenvaluesλi,i= 1, ..., t are estimated from the covariance matrixS. By using the matrix Φ = [φ1, φ2, ..., φt], it is possible to approximate every shape xusing the following expression:
x≈x¯+ Φb (2.3.3)
Wherebis at-dimensional vector representing the parameters space of the deformable
model. The number of eigenvalues tis selected to explain the desired percentage of shape
variation. The value of the elements of b is constrained to a specic valid range. For
instance, we can restrict to |bi| ≤ 3
√
λi, (1 ≤ i ≤ t) the range of the values of the elements of b. In this way, the deviation is less than ±3 standard deviations along each
variation mode, as is shown in Figure 2.4.
Figure 2.4: Eect of varying the rst three shape parameters for a hand model (Reproduced from Cootes et al. [11]).
The ASM approach includes a tting algorithm which ts the model to a new shape
according to the test image. To match a shape model to a new image, we require a statistical shape model and a model of the image structure at each point. See 2.5.
Steps 3 and 4 actually correspond to an optimization subroutine which modify rigid
body and shape parameters to reach the best t with shape Y. The aim is to minimize
2.3. ACTIVE SHAPE MODELS: ASMS 27
1. Initialize b to zero.
2. Generate the model instanceX=X¯ + Φb
3. Obtain candidate points by displacing each point from X to a better position
according to the appearance model. Call this new shape model Y.
4. Optimize the rigid body parameters Θrigid = (Xc, Yc, s, θ) in order to reach the best t between the shape model X and the shape model Y X0 will be the
transformed model (translated, rotated, scaled).
5. Finding the shape parameters b which modify to X0 to achieve the best t to
the model Y, considering the restrictions imposed to b.
6. If there is not convergence, return to step 2.
Figure 2.5: ASM tting algorithm
this expression can be minimized by either, putting it into a general optimizer, or using a two stage iterative approach. Figure 2.6 shows the tting process for an ASM.
Figure 2.6: Fitting anASM to a test image (Reproduced from [9]).
2.3.1. Appearance Model for Shape Alignment
The main goal of the ASM approach is to locate the more relevant features of a face
in an image. The shape model described above is not enough to t it to a novel image. In order to t a shape model, we need certain a priori knowledge about the appearance of the contours of the features that we want our shape model ts. An ASM model uses
a statistical model of the appearance of the pixels surrounding each point of the shape model. Some times the mean appearance is used to simplify the problem.
Because each point in the model corresponds to a particular feature of the face, the pattern of gray levels surrounding the point should be similar for the dierent training face images [12, 15]. Each point in the model is associated with a geometrical direction which will be used for sampling the gray levels surrounding that point. This direction is a straight line segment normal to the curve or contour crossing the point. This region is known as prole. In order to achieve robustness to the dierence of brightness among the dierent images, the used prole is not based directly on the gray level but in the mean derivative of the gray levels. Figure 2.7 illustrates the process of correlating the derivative prole of the test image with the mean prole learned for that landmark from the training set.
Figure 2.7: Correlating theASM proles. Selecting the point along prole which best correlates with
the learned pattern. a) Original prole in gray level taken from the test image. b) Derivative of the original prole. c) The derivative prole is correlated with the mean derivative prole learned from the training images. Thus, the original landmark (green point) is moved to the position of maximum correlation (red point) (Reproduced from Cootes et al. [11]).
To t the model to a new image, we obtain the dierence between the proles computed over the test image and the mean proles of the model. Using correlation, we can move each point from its original locationXi to a new positionYi. Then, as described before, we have to compute a new shape X0 by aligning shape X to shape Y. This alignment is done by
varying the shape parametersband the rigid body parameters Θrigid−body = (Xc, Yc, s, θ). Figure 2.8 shows a prole associated to a point of theASM and how that point is moved
through the prole until reaching a maximum value of correlation (see Figure 2.8(b)).
Figure 2.9 shows the way to obtain the shape modelY which is obtained by displacing
2.4. ACTIVE APPEARANCE MODELS 29
Figure 2.8: Moving a landmark to the best location. a) A landmark (yellow point) associated to a shape model is moved through its prole direction until a maximum correlation value between the mean prole and the sampled prole is reached. b) Correlation curve (Reproduced from [9]).
Figure 2.9: Moving each landmark to a better position by using correlation techniques (Reproduced from Cootes et al. [11]).
2.4. Active Appearance Models
Active Appearance Models orAAMs [15, 13, 14, 30], originally introduced by Edwards
et al. in [17], are generative models with the ability of face synthesis. Once estimated a compact set of basis modes of variation from a large training set, it is possible to adjust the model parameters in order to t an image produced by the model to a novel image. The recovered parameters carry important information about the image which is useful for interpretation. Instead of modeling the appearance of the pixels surrounding the vertices of the model as is done in ASM, AAM constructs a statistical model of the appearance
contained into the whole area of the object. In addition, a 2D shape model similar to that used in ASM is constructed too.
reference points or landmarks placed over features which are common in all faces.
AAMs are parametric models and they are non-linear in terms of pixel intensities,
but they are linear in both shape and appearance. The result of mapping a synthesized appearance to a deformed shape is clearly a non-linear process.
AAMs are designed not only for face synthesis, but also for tting face images
generated with the model to novel images. The tting process of an AAM consists in
nding the appropriated parameters which maximize the correspondence between the synthetic image and the input image. The original work of Cootes et al ([13]) proposed to use the same set of linked parameters for shape and for appearance. However, two dierent and independent sets of parameters for shape and appearance provide a wider capability for tting to images.
2.4.1.
AAM
s with Independent Parameters of Shape and
Appear-ance
Shape model
The shape of anAAM can be visualized as a virtual wire-frame composed by triangular
facets. This wire-frame is dened by the positions of a number of nodes or vertexes. Formally, we can dene a shape sas a set of coordinates of the v vertexes which conform
the wireframe:
s= (x1, y1, x2, y2, ..., xv, yv)T (2.4.1)
The AAM allows linear variation of the shape s, which can be expressed as a basis
shapesplus a linear combination of n shape vectors si, withi= 1,2, ..., n. This operation can be expressed in matricial notation as
s=s+Qsc (2.4.2)
where c is a column vector whose elements are arbitrary shape parameters, and Qs is a matrix whose columns are orthogonal shape eigenvectors. Usually, expression 2.4.2 can be obtained by applying Principal Component Analysis to a set of training shapes. Each training shape is created by labeling manually a face image with landmarks located in strategic positions. These positions must be common for all the human faces.
The base shape s is the mean shape and the columns of Qs are eigenvectors which correspond to the greater eigenvalues in the covariance matrix calculated over the set of training shapes. Because we have several shapes, we can consider a generalized shape where each element is a random variable.
Before applying P CA, shapes have to be aligned. This alignment of shapes removes
2.4. ACTIVE APPEARANCE MODELS 31
by using a technique known as Procrustes Analysis which is an iterative method that minimizes the Euclidean distance between pairs of corresponding landmarks, see [40] for details. In this way, we can remove shape variation related to rigid body transformation (translation, rotation and scale). We only want to model shape variation related to local deformation, for instance: identity, expression, pose.
Figure 2.10 shows the mean shape sand the rst three eigenshapes: s1,s2, and s3.
Figure 2.10: AAM linear shape model. S¯ is the mean shape and S1, S2, S3 are shape eigenvectors which represent the principal modes of shape variation (Reproduced from Matthews et al. [30]).
Appearance
Appearance is modeled as pixel intensities within a shape, i.e., the image contained into the mean shape frame conformed by s. Let x be any pixel (x, y) which lies inside
the mean shape s. The appearance of an AAM is, therefore, an image g(x) dened over
the pixels x ∈s. The AAM allows linear appearance variation, and this means that the
appearance g(x) can be expressed as the base appearanceg(x)plus a linear combination
of m images gi(x), also known as eigenfaces
g(x) = g(x) + m
X
i=1
digi(x) (2.4.3)
where di's are m arbitrary appearance parameters. This equation can be expressed in matrix notation as
g =g+Qgd (2.4.4)
whered is a column vector containing parameters di and the eigenfaces can be placed as column vectors into a matrixQg. In a similar way that with shape, the mean appearance
g and thegi images contained into the matrixQg are computed by applyingP CAto the set of shape-normalized training images. This leads to the orthogonality between images
gi which form a linear subspace of lower dimensionality than the original face image space containing all the training set.
All the faces are transformed in shape by mapping or warping original face images to the mean shape which has frontal pose. ThenP CA is applied. The process of warping is
performed by following the next steps: the shape of each training image is triangulated and this triangulation order is used to triangulate the mean shape too. Then, an ane mapping transformation is dened for each pair of corresponding triangles. Once concluded this mapping process for each training image, P CA can be applied to the set of
shape-normalized images.
The base image will be the mean appearance and the images contained in matrix Qg will be the eigenf aces which correspond to the m greater eigenvalues. See Figure 2.11,
where the base appearance is shown to the left and the rst three appearanceeigenf aces
are shown to the right.
Figure 2.11: AAM linear appearance model. g(x) is the mean face computed using all the training images, meanwhileg1(x),...,g3(x)are the eigenvector images computed by usingP CA(Reproduced from Matthews et al. [30]).
Instantiation
Equations 2.4.2 and 2.4.4 describe the shape and appearance variation of an AAM
model. However, they do not describe how to generate a synthetic instance using the model. Given the shape parameters c = (c1, c2, ..., cn)T, we can use Equation 2.4.2 to generate an arbitrary shape. Similarly, with the arbitrary appearance parameters d = (d1, d2, ..., dm)T, it is possible to generate the appearance g dened inside of the mean shape. In addition to using matrix notation, we can also dene g(x) as the same image
but expressed as a function of the position vector x= (x, y).
Then, the AAM instance with shape parameters c and appearance parameters d is
created by warping the image or appearance from the mean shape to the new shape s.
We denote this ane warping as W(x;c). Function W transforms the original positions
xinto a new set of positions according to the ane warping.
The instantiation process is performed by assigning the pixel intensities of the new appearance g(x) to the new set of pixels W(x;c) belonging to a new synthetic instance
imageM. This simple mathematical representation of the warping process produces holes
2.4. ACTIVE APPEARANCE MODELS 33
backwards by taking each pixel in the new shape frame of the instance and computing its respective position in the mean shape frame. Bilinear interpolation may be used when the resulting positions over the mean shape frame are not exact.
Figure 2.12 shows the instantiation process dened as:
M(W(x;c))←−g(x) (2.4.5)
Figure 2.12: AAM instantiation. An instance face image can be synthesized by using theAAM model.
A particular face image (appearance) computed as is illustrated on the top of the gure can be warped to a particular shape generated with the shape model (bottom) (Reproduced from Matthews et al. [30]).
Fitting of an AAM
Let It be an input test image to be tted by the model. We need to vary the shape parameters c and the appearance parameters d iteratively until the instance created by
the model M(W(x;c))is similar to that in the input image It. This process is known as tting and is illustrated in Figure 2.13.
To dene the cost function or the criterium to be minimized, we emphasize that we want to minimize the error betweenItandM(W(x;c)). This error will be calculated inside of the mean shape frame s. For this purpose, a backwards warping should be computed
from the input test image It to a reference mean shape frame. This warping creates a new shape-normalized image that we call gs(x). This backwards warping can be described as
Figure 2.13: Fitting anAAM (Reproduced from Cootes et al. [11]).
gs(x)←−It(W(x;c)) (2.4.6)
We want to compare this shape-normalized image with the shape-normalized appear-ance created by the model AAM using 2.4.4. Therefore, we desire to minimize the sum
of squares of the dierence between these two quantities:
X
x∈¯s
[gs(x)−g(x)]2 (2.4.7)
Using matrix notation, if gs is a column vector containing image gs, and gm is the
column vector containing the appearance g(x) created by the model, then the expression
to minimize is
E =kgs−gmk2 (2.4.8)
We have to minimize 2.4.8 simultaneously with respect to the shape parameters c,
to the appearance parameters d and to the rigid body transformation parameters t
2.5. ALIGNMENT OF AN AAM 35
2.5. Alignment of an
AAM
AAM are aimed to the automatic interpretation of novel images. Thus, it is necessary
to perform an alignment process or tting process to the image by means of an optimization algorithm. This optimization algorithm should adjust the model parameters iteratively until a convergence criterium is reached. The nal parameters describe the original image. Usually the optimization algorithm is based on recomputing iteratively the model parameters by using the residuals which result from the dierence between the model and the original image.
2.5.1. Minimizing Residuals
Model parameters: c,band rigid body parameters t= (tx, ty, θ, s), dene the absolute location of the model points in the image. These model points conform a virtual at wireframe which is made of triangles. During the alignment, the pixels inside this area are warped to the mean shape. We denote this warped image as gim. Then, a gray level adjustment is performed over this image by scaling and displacing gray levels (oset),
gs =Tu−1(gim). The dierence between the model and the image is:
r(p) =gs−gm (2.5.1)
wherepare all the model parameters,pT = (cT|tT|uT), andgmrepresents the mean model normalized over the mean shape. A simple and known measure of dierence between two images is the sum of the squares of the elements ofr,E(p) = rTr. The Taylor expansion of rst order of (2.5.1) is
r(p+δp) =r(p) + δr
δpδp (2.5.2)
where the ij−th element of the matrix δδpr is δpjδri.
On the other hand, if the current residual is r during an iteration of the alignment process, we would like to choose a δp such that it minimizes |r(p+δp)|2. By equating
2.5.2 to zero, we obtain the solution
δp=−Rr(p) where R= (δr T
δp δr δp)
−1δr T
δp (2.5.3)
Strictly, we should recompute δr
δp in each iteration, but this is a high cost computational task. However, in [14] authors explain that because δr
δp is being computed inside of a reference normalized shape frame, it can be considered approximately xed. Therefore, this Jacobian can be calculated only once by using the training set. The j-th column of the Jacobian is computed by systematically displacing each parameter from a optimum known value on typical images and calculating a weighted average of the residuals.
2.5.2. Iterative Alignment Algorithm
Given a current estimate of the model parameters p and a sample of the image in the
current estimate gim, the steps of the iterative procedure will be:
1. Project the texture sample (appearance) into the texture model frame using gs = T−1
u (gim)
2. Evaluate the error vector, r =gs−gm, and the current error, E =|r|2
3. Compute the predicted displacements, δp =−Rr(p)
4. Update the model parameters p=p+kδp, where initially k = 1
5. Calculate the new points of the model, X0 and the new model frame texturegm0
6. Sample the image at the new points to obtain g0im
7. Calculate a new error vector, r0 =Tu−01(g0im)−g0m
8. If |r0|2 < E, then accept the new estimate; otherwise, try at k= 0.5, k = 0.25, etc.
This procedure is repeated until no improvement is made to the error and then we can assume that the convergence has been reached.
2.6. Conclusion
Active Appearance Models, such as they were proposed originally, are a exible tool for tting a 2D model to the at projection of a 3D object. This model knows only 2D shape variations which can be originated by both 3D pose variations and identity or expression variations. The model do not distinguish between pose and shape variations. Thus, the model only can represent a limited set of appearance possibilities which is determined by the training set. Poses not considered within the training set cannot be modeled. The problem even gets worse if we try to model a not-learned pose applied to an individual which is not included in the training set either. Therefore, AAMs lack of the necessary
generality for an appropriate modeling of 3D pose and 3D shape.
On the other hand, the appearance model of an AAM is determined by the
combination of the particular textures and lighting which were present during the training phase. So, a new combination of texture and lighting, not included within the training set, cannot be modeled. In fact, texture and lighting are two dierent properties which should be modeled separately. Therefore, AAMs also lack of the necessary generality for
2.6. CONCLUSION 37
Our work pretends to use the basic fundamentals of AAMs for developing a novel
paradigm which provides the same advantages of AAMs, such as their tting speed and
their exibility for adaptation. However, in contrast to AAMs, our paradigm provides a
separated modeling of the physical attributes which aect the appearance of faces, such as 3D pose, 3D shape, albedo and lighting. In this way, our proposed method provides the advantages ofAAMs plus the necessary generality in order to represent a wide range
Chapter 3
Techniques for Face Interpretation
3.1. Introduction
Face interpretation is the action of recovering relevant information from a single face image. In contrast to face recognition (a particular case of face interpretation where only identity must be determined), face interpretation can viewed as a more general concept including techniques capable of recovering important aspects such as 3D shape, texture, pose, albedo, illumination, expression, age, gender, etc. The problem of interpreting face images has been addressed in many ways. One of them is the paradigm known as analysis by synthesis, where interpretation is performed by synthesizing a face image as similar as possible to the face image to interpret. Recently, the interest in this kind of approaches has increased due to its natural capability of adaptation to novel images. Interpretation is performed by adapting a deformable model to a face image. This process of adaptation is achieved by adjusting iteratively a set of synthesis parameters. The parameters are used for interpretation when this tting process concludes.
3.2. Approaches Based on 3DMM: Morphable Models
In [7] and [8], Blanz et al. obtained detailed facial reconstructions by tting a 3D dense model to a face image. Their model, known as 3D Morphable Model, or 3DMM, combines a dense 3D shape model and a texture model in order to estimate 3D shape and texture parameters related to the input image. In order to construct the model, detailed dense 3D scans of dierent persons are needed. These dense scans are used as a point to point correspondence for computing a statistic and parametric morphable model. Figure 3.1 illustrates the morphing process in 3DMM models.
3DMMs are prone to fall in local minima when no proper initialization is given. To alleviate this condition, the convergence properties of 3DMM models have been improved