Registration of a three-dimensional model of a human head over a scene using color and depth information

(1)

model of a human head over a scene

using color and depth information

By:

Miguel de Jesús Osorio-Ramos

Thesis submitted as partial fulfillment of the requirement for the degree of:

MASTER OF SCIENCE IN COMPUTER SCIENCE

at

Instituto Nacional de Astrofísica, Óptica y Electrónica

August, 2014

Tonantzintla, Puebla, Mexico

Advisor(s):

Dr. Leopoldo Altamirano-Robles

INAOE

(2)

(3)

que lucharon por sus derechos y que como muchos de nosotros soñaban con un mejor país.

(4)

(5)

Agradezco sinceramente a todas las personas que directa e indirectamente me apoyaron en la realización de este trabajo.

Muy especialmente a mis papás, Marcos y Margarita, por siempre haberme incentivado a perseguir mis sueños. Por ser mis guías desde pequeño y enseñarme que con dedicación y esfuerzo se logran grandes cosas. A ellos, las dos personas que más admiro y que siempre han confiado ciegamente en mí, mil gracias. A Dios por darme una familia maravillosa.

A Malli, por estar a mi lado en todo momento, darme ánimos y tenerme paciencia; además de llenar mi vida de amor, de sonrisas y locuras. Por soportar mis preocupaciones muchas veces infundadas y facilitar el trabajo con sus geniales ideas.

A Conchy, Checo, Andy y Sofy, por todo el cariño incondicional que me brin-dan, siempre apoyándome y diciéndome “¿Cuándo vienes a casa? Ya queremos verte”. Que con una sola palabra pueden levantarme el ánimo y las ganas de seguir adelante.

Al Dr. Leopoldo Altamirano por su tiempo y consejos, por despejar mis dudas e impulsarme a lograr el objetivo. Al Dr. Felipe Orihuela, a la Dra. Pilar Gómez y al Dr. Jesús González por los comentarios y observaciones realizados a este documento que permitieron realizar un mejor trabajo. También a los doctores que nos prepararon en el aula durante el primer año de la maestría.

A toda mi familia y amigos que no están físicamente conmigo pero siempre es-tán brindándome su apoyo y buenos deseos; especialmente a Jaime por brindarme

(6)

VI

su amistad por años y sus consejos antes y durante mis estudios en el INAOE. A todos los amigos del INAOE, que son muchos y me limitaré a llamarles Teporingos, que más allá de ser valiosos como jugadores lo son como amigos. Por las comidas, juegos, pláticas y risas compartidas.

A Carlos, Diana, Ricardo y Viviana por prestar amablemente su tiempo para la creación de las bases de datos de tomografías y escenas usadas en este trabajo. A todo el personal del INAOE por las facilidades prestadas que hicieron placentera mi estancia en el instituto, al personal del Hospital Naval de Lázaro Cárdenas por su apoyo brindado en la obtención de tomografías computarizadas y al CONACYT por la beca otorgada con número280277/345253sin la cual no hubiera podido realizar este posgrado.

Todos los logros descritos en esta tesis se los debo a ellos, míos son los errores y omisiones.

(7)

The objective of Augmented Reality (AR) is to enhance a user’s perception of an environment in real life by providing useful additional information. To achieve this objective, virtual objects must be correctly positioned in a real environment to improve the sense of coexistence between real and virtual objects.

In medicine, AR has been well received as it has shown great potential when aiding in diagnosis, surgical planning or surgical procedures because specific regions can be precisely targeted. The observation of spatial relations between the virtual objects and the environment are essential in minimally invasive procedures. Regarding head procedures, a precise registration is of utmost importance. Even though there has been research in this area, further investigation can lead to invaluable discoveries.

When using low-cost computer vision systems, the aligning task becomes difficult because of excess noise. Using the scene’s depth information obtained from low-cost range sensors has shown good results; however, these results are still far from being dependable for the medical field.

This thesis addresses the issue of registering a virtual model with a human head where images are acquired from two different sources. The first one is an X-Ray Computed Tomography (CT) scan of the head which is used to create a three-dimensional (3D) virtual model. The second is the augmented scene, obtained through a low-cost range sensor taking advantage of depth information. The proposed method aims at reducing registration errors between the3D virtual model of the head and the one in the real world scene.

(8)

VIII

Our method involves a series of segmentation processes to bound the region of interest, a preliminary head pose estimation which serves as a basis for the next step, and a final pose refinement through the Iterative Closest Point algorithm in order to reach the goal of precise alignment of a human head within a virtual environment.

We tested the registration results of the proposed method against the registra-tion of other approaches found in literature addressing the same problem and using low-cost computer vision systems. For evaluation, we built a new database which was manually labeled with the transformation values to align the model with a scene. The reliability of the database was evaluated using an inter-rater reliability test in order to check the agreement of the raters.

The proposed method obtained better results when aligning the model with the scene than in other approaches. Our method reduces the error in both translation and rotation with statistical significance in most of the evaluated poses.

The results are promising and indicate a progress towards a useful application in the near future that could aid physicians in the office or in operating rooms by providing them with additional information in treating patients with head related illnesses.

Keywords: image-guided interventions, computer-aided medical procedures, medical augmented reality, 3D registration.

(9)

El objetivo de la Realidad Aumentada (AR porAugmented Reality) es mejorar la percepción del usuario sobre el ambiente en el que se encuentra proveyéndole de información adicional. Para alcanzar este objetivo, los objetos virtuales deben estar correctamente ubicados en la escena del ambiente real para mejorar la sensación de una convivencia entre los objetos reales y los virtuales.

En medicina la AR es cada vez más aceptada ya que ha mostrado un gran potencial al momento de apoyar en diagnósticos, así como en la planeación y ejecución de cirugías, facilitando que ciertas regiones de interés pueden ser selecionadas. La relación espacial entre los objetos virtuales y el ambiente es esencial para realizar procedimientos médicos mínimamente invasivos. Cuando hablamos de procedimientos en la cabeza de un paciente, la precisión en el registro toma más importancia. Aún cuando ha habido investigación en el campo, los resultados actuales son susceptibles de mejora.

Cuando se usan sistemas de visión por computadora de bajo costo, la tarea se complica ya que dichos sistemas usualmente producen mucho ruido. El uso de información de profundidad de la escena obtenida a través de sensores de bajo costo ha mostrado buenos resultados. Sin embargo, dichos resultados están lejos de ser suficientemente exactos como para su uso en el campo médico.

Esta tesis aborda el tema de la alineación de la cabeza de un paciente adquirida desde dos fuentes diferentes. La primera es la de una tomografía computarizada (CT) realizada a la cabeza del paciente y que es usada para crear el modelo virtual tridimensional. La segunda es la escena a aumentar, la cual es obtenida a través de

(10)

X

un sensor de profundidad de bajo costo con el fin de aprovechar la información de profundidad de la escena. El método propuesto tiene el propósito de reducir el error de registro entre el modelo 3D de la cabeza del paciente y el paciente mismo en una escena del mundo real.

Nuestro método involucra una serie de procesos de segmentación para separar la región de interés, una estimación preliminar de la pose de la cabeza la cual sirve como base para el siguiente paso, el refinamiento final de la pose a través del algoritmo ICP (Iterative Closest Point).

Se realizaron pruebas de los resultados de alineación del método propuesto contra la alineación de otros enfoques encontrados en la literatura que atacan el mismo problema y que utilizan sistemas de visión por computadora de bajo costo. Para la evaluación, se creó una nueva base de datos que fue etiquetada manualmente. La fiabilidad de la base de datos fue evaluada usando un test de estadístico para medir la concordancia de los evaluadores.

El método propuesto obtiene mejores resultados al momento de registrar el modelo con la escena que los otros enfoques. Nuestro método reduce el error en ambas, rotación y traslación, con significacia estadística en la mayoría de las poses evaluadas.

Los resultados son prometedores e indican un buen avance hacia una apli-cación útil en el futuro que pudiera apoyar a médicos en el consultorio o en la sala de operaciones, al proveerles información adicional de la cabeza del paciente.

Palabras clave:intervenciones guiadas por imágenes, procedimientos médicos asistidos por computadora, realidad aumentada médica, registro 3D.

(11)

List ofFigures v

List ofTables ix

1 Introduction 1

1.1 Outline . . . 1

1.1.1 Justification . . . 2

1.1.2 Work motivation . . . 3

1.1.3 Research questions . . . 3

1.1.4 Hypothesis . . . 4

1.2 Objective . . . 4

1.2.1 General objective . . . 4

1.2.2 Specific objectives . . . 4

1.3 Proposal . . . 5

1.3.1 Contribution . . . 5

1.4 Document guide . . . 6

2 Theoretical background 9 2.1 Augmented Reality . . . 9

2.2 Range sensors . . . 11

2.3 X-Ray Computed Tomography scanner and3D reconstructions . . 15

2.4 Depth maps & point clouds . . . 15

(12)

ii

2.5.1 Viola-Jones object detection framework . . . 17

2.6 Head pose estimation . . . 21

2.6.1 Pose estimation by image retrieval . . . 22

2.6.2 Detector arrays . . . 22

2.6.3 Geometric methods . . . 23

2.6.4 Regression methods . . . 23

2.7 Registration . . . 27

2.7.1 RANSAC . . . 28

2.7.2 Normal Distribution Transform . . . 28

2.7.3 Iterative Closest Point . . . 29

3 Related work 33 3.1 Applications of Augmented Reality . . . 33

3.1.1 Augmented Reality and medicine . . . 35

3.2 3D matching methods . . . 43

3.3 Head pose estimation and skin detection . . . 46

3.3.1 Head pose estimation . . . 46

3.3.2 Skin detection for segmentation . . . 50

3.4 Summary . . . 53

4 Proposed method 57 4.1 Outline . . . 57

4.2 Head detection and segmentation . . . 58

4.3 Preliminary pose estimation . . . 62

4.4 Iterative Closest Point refinement . . . 64

4.5 Summary . . . 68

5 Experiments and results 71 5.1 Experimental setup . . . 71

5.1.1 Databases . . . 71

5.1.2 The ground truth . . . 75

(13)

5.2.1 Reliability of ground truth . . . 79

5.2.2 Transformation error per pose . . . 80

5.2.3 Overall transformation error . . . 86

5.2.4 Comparison with other registration approaches . . . 104

5.2.5 Qualitative results . . . 107

5.2.6 Discussion . . . 111

5.3 Summary . . . 115

6 Conclusions and future work 117 6.1 Conclusions . . . 117

6.2 Future work . . . 118

A Extension of experiments 121 A.1 Transformation error per subject and per pose . . . 121

B Technical features 139 B.1 Calibration of depth sensors . . . 139

B.1.1 Internal camera parameters . . . 139

B.2 PC for testing and versions of programs . . . 140

(14)

(15)

2.1 The world’s first head-mounted display . . . 10

2.2 Reality-Virtuality Continuum . . . 11

2.3 3D model of pipelines superimposed on the factory floor . . . 11

2.4 Kinect sensor structure . . . 13

2.5 Kinect infrared pattern . . . 14

2.6 Self-occlusion in stereo vision . . . 14

2.7 SOMATON Emotion CT Scanner and tomograms of the head . . . 16

2.8 RGB image, depth map and point cloud data of a scene . . . 16

2.9 Haar-like features for Viola-Jones detection algorithm . . . 18

2.10 The integral image concept . . . 19

2.11 Integral images to compute features . . . 19

2.12 Detection cascade for Viola-Jones algorithm . . . 21

2.13 Example of Discriminative Random Regression Forests (DRRF) . . 24

2.14 Positive and negative samples for DRRF algorithm . . . 25

2.15 Correspondences in ICP algorithm . . . 30

3.1 Volkswagen test scenario application with AR . . . 34

3.2 ARQuake: an Augmented Reality video game . . . 34

3.3 Simulated visualization in laparoscopic surgery . . . 36

3.4 Mirracle. A look inside the body . . . 38

3.5 SHOW-CAS. A Spinal Surgery technique . . . 39

(16)

vi

3.7 Video see-through system for hepatobiliary surgery . . . 41

3.8 Short-rigid stereo scope for open abdominal surgery . . . 41

3.9 A system for neurosurgical assistance. . . 43

3.10 Visuo-haptic AR for open surgery training system . . . 44

3.11 A face tracking and pose estimation system . . . 47

4.1 General diagram of the proposed method. . . 59

4.2 An example of head detection and segmentation . . . 61

4.3 Discriminative Random Regression Forests voting scheme . . . 63

4.4 An example of skin segmentation within a scene . . . 65

4.5 Removing the useless section of the model . . . 66

4.6 An example of the cropped model . . . 67

4.7 Example of normal vectors of the model . . . 68

5.1 OsiriX interface showing a CT scan. . . 73

5.2 Surface rendering parameters and 3D virtual model in OsiriX. . . . 73

5.3 Subject’s positions for the scene database . . . 75

5.4 The25poses for experiments as if the subject is seeing a wall. . . . 76

5.5 An example of a set of poses collected from the subjects. . . 77

5.6 Box plot of translation error per pose . . . 81

5.7 Box plot of rotation error per pose . . . 83

5.8 Translation error per pose for each subject and all together in our method . . . 95

5.9 Rotation error per pose for each subject and all together in our method . . . 97

5.10 Heat map of the errors per pose according to GMSE metric . . . . 99

5.11 Overall translation errors of the three compared methods . . . 100

5.12 Overall rotation errors of the three compared methods . . . 102

5.13 Qualitative comparison between methods. . . 109

5.14 Examples of successful registration using DRRF+ICP. . . 112

(17)

A.1 Translation error per pose for each subject usingl₂-norm. . . 122

A.2 Translation error per pose for each subject using RMSE. . . 124

A.3 Translation error per pose for each subject using GMSE. . . 126

A.4 Translation error per pose for each subject using absolute error. . . 128

A.5 Rotation error per pose for each subject usingl₂-norm. . . 130

A.6 Rotation error per pose for each subject using RMSE. . . 132

A.7 Rotation error per pose for each subject using GMSE. . . 134

A.8 Rotation error per pose for each subject using absolute error. . . 136

(18)

(19)

3.1 Comparison of characteristics between our method and existing work. 55 5.1 Description of subjects that were integrated into our database. . . 74 5.2 Krippendorff’sαcoefficient results for testing inter-rater reliability. 79 5.3 P-values obtained from the test of normality with Shapiro-Wilk test

applied to the error transformation results. . . 85 5.4 Median and IQR of translation error per pose of our method versus

the others usingl2-norm . . . 87

5.5 Median and IQR of rotation error per pose of our method versus the others usingl₂-norm . . . 88 5.6 Median and IQR of translation error per pose of our method versus

the others using RMSE . . . 89 5.7 Median and IQR of rotation error per pose of our method versus

the others using RMSE . . . 90 5.8 Median and IQR of translation error per pose of our method versus

the others using GMSE . . . 91 5.9 Median and IQR of rotation error per pose of our method versus

the others using GMSE . . . 92 5.10 Median and IQR of translation error per pose of our method versus

the others using AE . . . 93 5.11 Median and IQR of rotation error per pose of our method versus

(20)

x

5.12 Median and IQR of errors of the three compared methods for the complete dataset. . . 104 5.13 Median and IQR of ICP errors versus other registration algorithms 105 5.14 Results obtained from using skin segmentation and not using it. . 106 5.15 Results obtained from segmenting the model and not doing it. . . 108 B.1 Internal camera parameters. . . 139

(21)

I

ntroduction

1.1 Outline

Augmented Reality (AR) is a technology that combines two worlds: a real one and a virtual one within an environment, leading to the perception of a coexistence between the two worlds. To achieve this, an AR system captures the real world through cameras and/or other sensors, processes the information, merges it with virtual objects and displays the results on a screen or directly over the physical object. For these systems to work, they have to perform typical computer vision methods and rendering, such as detection, visualization and interaction.

AR is used in several areas of knowledge and medicine which is not surprising, considering it has been found that AR is of great use when enhancing the physi-cian’s environment providing additional information that cannot be obtained through traditional manners. This should be done in a realistic manner in order to better help in diagnosis, preoperative planning and surgical procedures. There are still important challenges, such as the correct registration of medical virtual objects over a real scene since medical specialists require increasingly better precision, especially pertaining to orientation, depth and size of the virtual objects.

The use of X-ray computed tomography (CT) has become popular in AR as it is possible to manage the given information by the CT scanner in a wide range of applications. An interesting characteristic of a CT scan is that from this medical imagery we can build three-dimensional virtual models of body parts such as

(22)

2 _Outline

bony structures, blood vessels, organs or, as is our interest, a human head. Thus, if we combine CT technology with AR technology, we can align a 3D virtual model within a scene incorporating a patient in a real world setting in such a way that the object adjusts to the perspective and improves the user’s perception.

Finding newer and better solutions to the registration task provides enormous advantages to physicians as they can identify topologically the location of a specific problem in a more reliable and useful way.

1 .

1 Justification

Augmented Reality (AR) has had a huge progress in research over the last decade and been increasingly introduced in our lives. It has been shown to be a powerful and effective tool for the visualization of enhanced environments mixing real and virtual elements. Improved user perception and interaction with virtual objects in the same environment allows for providing additional information that could not be perceived otherwise. A wide range of applications make use of this technology because of its benefits, it is not surprising that AR is influencing medicine and is readily accepted by both physicians and patients.

The constant improvements in medical applications and its scope have been an incentive for this to happen. Having dynamic, accurate and reliable systems while interacting the real and virtual worlds allows for better user experience with AR applications while providing additional benefits.

In response to the great advances of AR in medicine (Navab et al., 2012), new

systems have demanded better characteristics. There is a necessity to develop new methods that allow the visualization of 3D medical imaging over the patient’s body, both for diagnosis, preoperative planning and surgical procedures. One of the requirements that physicians demand is for virtual objects that overlay real ones to be correctly positioned in the environment.

(23)

1 .

2 Work motivation

Ideally, in AR, virtual elements enhancing a real environment would behave like real objects. This would require a correct and precise estimation of where the virtual objects would be located. (Herling and Broll,2011).

The challenge of correctly registering virtual objects over an environment has been a widely addressed topic of AR research since there are many problems that require high accuracy for correct visualization of virtual elements in an augmented environment (Bimber and Raskar,2005).

In the medical field, when working with the human head, accurately reg-istering real and virtual objects is a must and, because the existent work does not fulfill this requirement, the area is still open to discovering a better solution improving user perception to facilitate advancements in surgical procedures. Due to the requirement for high accuracy, the surface features inherent in patients can be used in order to better align virtual objects with real ones and render them visually aligned in an augmented environment.

1 .

3 Research questions

• What is the impact of head detection (Viola-Jones algorithm) and skin segmentation (Marcial-Basilio’s rules) before a refinement pose estimation with ICP is done, in the registration error of two heads representations from the same person obtained from different sources (a CT scan and a range sensor)?

• What is the effect of computing a preliminary pose of a human head in a scene captured with a low-cost range sensor (up to50times less expensive than high-resolution sensors) in the final registration error with a3D virtual head obtained from a CT scan from the same person?

(24)

4 Objective

1 .

4 Hypothesis

When aligning a 3D virtual model of a human head which has been obtained from a CT scan with a scene taken with a low-cost depth sensor, the use of color information from the scene to detect the person’s head and segment the skin, together with the depth information to estimate an approximate pose of the head, reduce alignment errors in the registration process for both rotation and translation to less than 5mm.

1.2 Objective

1 .

2 .

1 General objective

A new method that enables the registration of a human head3D model within a scene with a subject in a real world setting using surface features derived from the subject’s anatomy in an Augmented Reality environment. The registration should be better than state-of-the-art approaches addressing the same problem.

1 .

2 .

2 Specific objectives

• Selection and evaluation of appropriate model and scene information.

• Location and orientation estimation of a human head in a real world scene.

• Algorithm design and implementation for virtual and real human head alignment.

• Evaluation of the effectiveness of the method in the registration of virtual and real objects.

(25)

1.3 Proposal

We propose a method aligning a 3D virtual model of a human head with the head of a person in scenes of the real world using surface features inherent to the subject’s anatomy, in order to enhance physicians’ perceptions of an environment providing information that cannot be seen visually because of the opaque nature of a head

There have been different approaches presented in state-of-the-art literature, from using simple2D images(Vatahska et al.,2007;Perreira Da Silva et al.,2008; Grujic et al., 2008) to examining 3D scene information (Fanelli et al., 2011a; Padeleris et al.,2012). This3D information can provide additional data, which in turn helps to improve alignment in3D space. We obtain this kind of data using consumer depth sensors which, despite being very sensitive to noise are often used, mainly because of their low cost.

One of the most recent and effective approaches for head pose estimation is Discriminative Random Regression Forests (DRRF) (Fanelli et al.,2011b) where

depth data and random regression forests are used to estimate the pose of a human head in a scene. We aim to improve this method by using not only depth data but RGB images to detect the head of the subject and segment it within a scene, thus reducing noise and processing time. Moreover, we use the improved DRRF to get a rough pose estimation before applying the Iterative Closest Point (ICP) algorithm to refine the pose and bring the 3D model closer to the real pose improving the accuracy of the alignment.

1 .

3 .

1 Contribution

This thesis contributes a method that identifies the transformation needed to register a 3D head model, built from the X-ray computed tomography imaging of a specific patient (usually called a 3D free-form model), with the head of the patient viewed in scenes taken from real world settings using the3D information of the environment obtained from consumer depth sensors. The 3D virtual

(26)

6 _Document guide

model of the head is correctly superimposed over the head of the patient in the environment. According to our experimental analysis, the alignment effectiveness of our method surpasses those results of tested approaches that also use low-cost vision systems.

Even though there are other issues, the alignment improvement is important in the medical field as this method together with future developments could be the foundation of a useful system aiding physicians in diagnosis, surgical planning or surgical procedures concerning a human head. In addition, its low cost makes it more feasible in medical applications.

Head surgeries —such as neurosurgery, craniotomy, otosurgery, rhinosurgery, maxillectomy or reconstructive surgery— can be enhanced when adding infor-mation that cannot be obtained through the traditional methods. An accurate head pose estimation leads to improving minimally invasive surgery which in turn, may reduce postoperative pain, shorten a hospital stay and enable a quicker recovery. Furthermore, robotic head surgery becomes very attractive as precision is a must.

In addition, reducing the registration error allows the development of reliable techniques for surgical education as medical students will be able to learn about anatomy and diseases related to the head as well as practice surgical skills.

1.4 Document guide

The rest of this document is organized as follows:

Chapter2introduces the theoretical basis that was used for this thesis.

Chapter3provides a reference framework of the related work in the area and discusses the limitations of the existing approaches.

Chapter4 describes the proposed method to solve the head pose estimation and registration problems in the augmented reality environment, while detailing each component and the form in which they are related.

Chapter5presents the experiments performed and the results obtained during this phase. We detail the experimental setup and discuss the results.

(27)

Chapter 6 states the conclusions of this thesis and presents a proposal for future research.

(28)

(29)

T

heoretical background

In this chapter, the theoretical foundations for the development of this research are presented. First, the concept of Augmented Reality in computer vision is introduced. It is described how a range sensor works and how the information provided can be used in research. We mentioned the tools used in development of this research. Finally, the algorithms on which this thesis is based are explained.

2.1 Augmented Reality

The first appearance of Augmented Reality dates back to the1950s when Morton Heilig built the Sensorama, a multimedia device that enhanced the experience of the user in cinematography (Carmigniani and Furht,2011;Nájera Gutiérrez, 2009). In the1960s Ivan Shuterland developed an AR prototype (head-mounted display) where, with its limitations, the user could see 3D graphics projected over an environment (Figure 2.1). Nevertheless, it was in 1992when the term Augmented Reality was first coined by Caudell and Mizell(1992) who proposed

a method to assemble the wires of an aircraft.

While the term Augmented Reality (AR) is just becoming common among people, extensive research in the area has been done. In contrast with Virtual Reality (VR) technology where the user is completely submerged in a synthetic environment, AR allows the user to view the real world with virtual elements added, that is, AR complements reality but it does not replace it (Leal Meléndrez,

(30)

10 _AugmentedReality

Figure2.1: The world’s first head-mounted display. Figure reproduced from (Sutherland, 1968).

2012).

We can define AR as the real-time direct or indirect view of the real world enhanced with virtual (computer generated) information in a way that both real and virtual information appear to coexist in the same environment (Carmigniani and Furht,2011; Azuma et al., 2001).

In1997, Ronald Azuma wrote the first survey on the subject providing a wide reference for definitions, terms, applications and the future of this technology (Azuma,1997). According toAzuma et al.(2001), an AR system has the following

three properties:

• combines real and virtual objects (information) in a real environment;

• runs interactively, and in real time; and

• registers (aligns) real and virtual objects.

The use of virtual objects over a real environment is where AR starts. Paul Milgram (Milgram and Kishino, 1994; Milgram et al., 1995; Milgram and Colquhoun, 1999)

presented the concept of Reality-Virtuality Continuum (Figure2.2) that describes AR as a part of the general area of Mixed Reality. In this continuum, AR provides local virtuality, that is, virtual objects that enhance the real environment.

(31)

Real

Environment EnvironmentVirtual Augmented

Reality AugmentedVirtuality Mixed Reality

Figure2.2: Reality-Virtuality Continuum byMilgram et al.(1995).

In 2005, the Horizon Report (Johnson and Smith, 2005) predicted that AR

technologies would emerge fully in the following years. This was confirmed by the advances in technology and development of applications. Figure 2.3 shows an example of AR where a3D virtual model of pipelines is superimposed on a factory floor (the real environment). As a result, they seem to coexist in the same space.

Figure2.3: 3D industrial pipe model registered within a view of the factory. Figure

reproduced from (Navab et al.,1999).

2.2 Range sensors

The problem of capturing3D information from a specific scene has been studied for a long time in computer vision. As different methods and technologies have

(32)

12 _Range sensors

emerged, Brian Curless (Curless,1999) classified them into two types: passive and

active. Passive methods or technologies attempt to take advantage of visual cues (stereo and motion parallax, occlusion, perspective, shading, focus, etc.) present in the human visual system; they are passive as they assume that the sensor records light that already exists in the scene. Such methods are limited by a number of factors that can be resolved by controlling how the scene is illuminated using certain patterns such as stripes or dots; in these cases, the methods are said to be active (structured light sensing).

Sensors have been used for years and they share a main objective: to get information identifying the 3D properties of an environment. Recent range sensors provide a traditional image (commonly in RGB) together with a range image which is similar to the previously mentioned image but containing depth information per pixel rather than color.

One of the most important recent developments in robotic sensors and com-puter vision is the production of low-cost3D sensors as they become affordable and are easily attainable, yielding lower quality images but up to 50 times less expensive than high-resolution range sensors. The most popular sensor is the Microsoft Kinect.

Since its release, the Kinect has become very popular not only with gamers but within the community of developers. It has been well received because it is an efficient, reliable and fast system at an attractive price.

Other popular consumer depth sensors are PrimeSense Carmine and Asus Xtion, which even though they are not as popular as Kinect (whose supplier for 3D sensing technology is PrimeSense), they have reported better precision to obtain 3D information. For this thesis we used Microsoft Kinect for Xbox 360 and PrimeSense Carmine 1.09 mainly because of their availability and the free availability of OpenNI, the open library needed to manipulate the devices.

In addition to the RGB camera, these low-cost range sensors integrate depth sensors consisting of an infrared (IR) laser projector together with an IR camera. Figure2.4shows the elements of the Kinect sensor.

(33)

Figure2.4: Kinect sensor structure. Figure reproduced from (Microsoft Developer Net-work,2014).

managing Natural Interfaces that contains a set of drivers for these devices, apart from applications and middleware that allows the access to them.

The availability of open source drivers and freely distributed programming tools for low-cost range sensors such as OpenNI and Point Cloud Library (PCL), apart from the increasing need of 3D information from the environment, have made these devices ideal for a wide range of Augmented Reality applications.

The technology of these sensors is based on the structured light principle. This technology casts infrared light patterns imperceptible by the human eye, so the scene is not visibly altered (See Figure2.5). The relative geometry between the IR projector and the IR camera as well as the IR dot-pattern are known. By matching a dot observed in the scene with a dot in the pattern of the projector, we can obtain3D information using triangulation (Zeng and Zhang, 2012).

The range sensors have limitations which impose restrictions to the thesis. First, the Kinect sensor range from a minimum of 800mm to maximum of4000 mm and Carmine 1.09limits are 350-3000mm, so the placement of the subject is based on the best area of visibility. There are problems with extreme lighting conditions, that suggest the need for a controlled environment. There are also issues with reflective (mirrors, water, glass, etc.) and absorbing (black) materials.

Another limitation of this technology is the well-known problem with visibility in stereo matching due to the physical position of the cameras (IR projector and IR camera). Figure 2.6 shows a situation in which an object is being observed

(34)

14 _Range sensors

Figure2.5: Kinect infrared pattern captured with an infrared camera. Figure reproduced

from (Röettger,2014).

Camera 1 Camera 2

a b

d

c

Figure2.6: Visibility of points in stereo views. Only points that appear in both views

are of value for depth estimation, points in the shaded region such asdare

eliminated.

by two cameras where, much of the object is not visible in either of the cameras because of self-occlusion while points of the object are visible in one or the other camera (Davies,2012).

(35)

2.3 X-Ray Computed Tomography scanner and

3D

re-constructions

An X-Ray Computed Tomography scanner (CT Scanner) provides a technology that makes use of computer-processed X-Rays (measuring attenuation) to produce a series of 2D images called tomograms, each of them represents a cross-section of the body or part of the body. CT Scanners organize the information into DICOM format (Digital Imaging and Communications in Medicine), a worldwide extended standard for handling medical imaging.

Different from other scanners, CT scanners do not produce point clouds but a set of tomograms which are then stacked together to produce a 3D representation. Since these are simply images, when viewed on edge, the slices disappear as they have no real thickness, but each tomogram represents a known virtual thickness of averaged material (DICOM data contains the exact virtual thickness information of the CT scan). Several methods for reconstructing3D virtual models from CT scans have been developed. For this research, we used OsiriX Imaging Software, a freely-available image processing software dedicated to DICOM standard; OsiriX performs the task satisfactorily. The DICOM information of the patients used for this thesis was obtained from a SOMATOM Emotion6 CT Scanner (See Figure 2.7), a popular CT system created by Siemens, with a virtual thickness of1.25mm for each tomogram.

2.4 Depth maps & point clouds

A depth map image contains information relating to distance from one viewpoint to objects within a scene. Each pixel in the map represents the distance from the camera to the object in the scene. In a depth map, the Z axis represents the central axis of the camera.

Point clouds are sets of data points that represent objects or scenes in a given coordinate system (usually Cartesian). In the3D world, this kind of data is usually

(36)

16 _Depth maps&point clouds

Figure2.7: SOMATON Emotion6 CT scanner and tomograms of the head. Left figure

reproduced from (Siemens Healthcare Global,2014). Right figure reproduced

from (Anvekar,2014).

(a)RGB image (b)Depth map (c)Point cloud

Figure2.8: Three representations of an environment: (a) RGB image, (b) depth map and (c) point cloud data.

obtained through 3D scanners like low-cost range sensors or by converting a3D mesh model into (x, y, z) points. Point clouds generally represent the surface of the 3D data, not the complete volume. They contain the basic data format for 3D perception systems but they provide meaningful information about the environment (Rusu, 2010).

Figure2.8a shows an RGB image with its corresponding depth map (Figure 2.8b) where darker areas are regarded as closer to the range sensor. Figure2.8c shows the same environment represented as a point cloud using the PCL viewer.

(37)

technology for recording that data have increased the necessity for powerful processing tools to manage the data. The Point Cloud Library (PCL) is a collection of state-of-the-art algorithms and tools that processes point clouds. PCL is used in 3D processing, computer vision and robotics (Rusu and Cousins, 2011;Aldoma et al., 2012).

2.5 Face detection

2 .

5 .

1 Viola-Jones object detection framework

The Viola-Jones algorithm is an object detection framework that is capable of processing images rapidly while achieving high detection rates. Proposed by Paul Viola and Michael Jones (Viola and Jones, 2001a,b) it can be trained to work with

any object; furthermore the framework has realized outstanding results in the domain of face detection (Viola and Jones, 2004).

The algorithm basically compounds three elements: the fast-computed features used by the detector, a simple and efficient classifier using the AdaBoost learning algorithm and a method for combining classifiers in a cascade.

2.5.1.1 Features

The classification of images in the face detector algorithm is based on the value of simple features, which for this algorithm operate much faster than a pixel-based system. These features are based on Haar basis functions (mostly known as Haar-like features in computer vision) as they involve the sum or difference of image pixels within rectangular areas. Viola and Jones use three kind of features (Figure2.9):

• A two-rectangle feature which computes the difference between the sum of the pixels of two rectangular regions. These regions are vertically or horizontally adjacent and have the same shape and size.

(38)

18 _Face detection

a

b

c

d

Figure2.9: Example of rectangle features used for detection in a position relative to the detection window. The feature is represented by the result of the sum of the pixels of white rectangles subtracted from the sum of the pixels of the black ones.

• A three-rectangle feature computes the sum of pixels within two outside rectangles subtracted from the sum of pixels within the central rectangle.

• A four-rectangle feature computes the difference between the sum of pixels within rectangles positioned diagonally.

During the detection phase, a sliding window is shifted around within the image where for each subsection the features are calculated. A vast number of features need to be calculated to describe an object with enough accuracy, which makes the task computationally expensive. Therefore, an intermediate represen-tation of the image was created and named Integral Image. Such represenrepresen-tation helps to compute rectangle features rapidly. An integral image at a point (x, y)

contains the sum of the pixels above and to the left of(x, y)inclusive (Figure2.10), which becomes very efficient if we need to calculate the sum of the pixels within many regions of interest within an image. Equation (2.1) shows the integral image calculation at a point(x, y), where ii(x, y) is the integral image and i(x, y)is the

(39)

(x, y)

Figure2.10: The integral image at a point(x, y)is the sum of all the pixels above and to the left of the point.

A B

C

1

4 2

D

3

Figure2.11: Sum of pixels within a rectangle. Integral image values at point1are the

sum of pixels of rectangle A, at point2is A+B, at point3A+C, and at point

4A+B+C+D. The sum within D can be computed as4+1-2-3.

original image (Viola and Jones, 2004).

ii(x,y) = X

x06x,y06y

i(x0,y0) (2.1)

It is possible to use integral images to compute the sum of a rectangle with four array references as shown in Figure2.11. As features are adjacent rectangular sums: for two-rectangle features, sums can be computed in six array references, for three-rectangular features eight, and for four-rectangle features, nine lookups. An extension of the Viola and Jones approach is proposed by Lienhart and Maydt

(40)

20 _Face detection

2.5.1.2 Classifier

The speed at which the features are computed does not counteract their number, therefore it would be enormously expensive to analyze all of them. So the framework adapts the AdaBoost learning algorithm to select the best features, combine them and train the classifier.

The AdaBoost learning algorithm is used to improve the classification perfor-mance of a simple learning algorithm; this task is achieved through combining a set of weak classifiers to build a stronger one. Within the first rounds of learning, the weight of the examples are changed responding to incorrect classifications by the previous weak classifier. The algorithm is efficient in searching for a small number of useful features, allowing the weak algorithm to select the single rectangle feature which best classifies the examples.

2.5.1.3 Classifier Cascades

Even though the learning process can be done relatively quickly, it is not fast enough to be evaluated in real-time. To solve this problem, the features are organized in a so-called “classifier cascade” to create a faster classifier. For this task, classifiers are organized in order of complexity, so that simpler classifiers are used rejecting the majority of sub-windows before more complex classifiers are applied as seen in Figure 2.12.

The cascade design process is given by rating each classifier with two rates, a false positive rate and a detection rate. Because the activation of a classifier depends on the behavior of its predecessor, the false positive rate of the cascade is

F=

K

Y

i=1

f_i (2.2)

whereF is the false positive rate of the cascade classifier,Kis the number of classifiers, and fi is the false positive rate of the ith classifier in the examples that

(41)

1

2

3

Rejection

Further processing All Sub-windows

Figure2.12: Description of a detection cascade. At the beginning, simpler classifiers eliminate a large number of negative examples with very little processing. The number of examples has been drastically reduced in the latest steps and stronger classifiers can be applied.

pass through it. The detection rate is

D=

K

Y

i=1

di (2.3)

where D is the detection rate of the cascade classifier, K is the number of classifiers, and di is the detection rate of the ith classifier in the examples that

pass through it (Viola and Jones,2004).

2.6 Head pose estimation

According to Murphy-Chutorian and Trivedi(2009), head pose estimation is the

process of inferring the orientation of a human head from digital imagery. It requires a process to transform from simple pixel-based representations to a high-level concept of direction, in other words, it is the ability to infer the orientation of a human head relative to a viewpoint (camera). It has become one of the most useful tools that leads to different uses and meets precision requirements.

Some work has been proposed using2D images as in (Huang et al., 2010; Dan-tone et al., 2012), and similar approaches using data containing depth information

(42)

22 _Head pose estimation

(range images) have shown very promising results (Breitenstein et al.,2008;Fanelli et al.,2011a, 2012).

2 .

6 .

1 Pose estimation by image retrieval

These methods use a database of images, each labeled with a discrete pose and represented with a set of features describing the image. They perform comparisons of a new input image with the images in the database in order to find the most similar view (Niyogi and Freeman, 1996;Vacchetti et al., 2004).

Even though these are simplistic methods in that they do not require negative examples, they have many disadvantages over other methods. The main disad-vantage is that these methods assume that the similarities in the image can be compared with the similarities in poses, which leads to high errors in estimating the pose as it could associate two images erroneously. In addition, only discrete poses are estimated.

As the database grows, the efficiency decreases. Some methods have tackled this problem by dividing the process into stages (Grujic et al., 2008).

2 .

6 .

2 Detector arrays

These methods are similar to image retrieval as they use the features of the image, but instead of comparing the image to the whole database, they evaluate the image using multiple head detectors that were trained in different discrete poses with a learning algorithm (Hu et al.,2004).

Each detector must be trained with positive and negative examples, therefore each detector is able to distinguish between a region with a head and a region without one. As these methods use multiple detectors, the image is assigned to the detector with the greatest probability.

(43)

2 .

6 .

3 Geometric methods

These methods rely on the location of facial landmarks such as the contour of the head, eyes, mouth, nose or ears to estimate a head pose from a relative initial pose (Horprasert et al.,1997;Lao et al., 2000; Vatahska et al.,2007; Wang and Sung, 2007; Perreira Da Silva et al., 2008). They take advantage of known properties of

the head and their impact on head pose estimation as they move.

These methods are fast as they consider few landmarks and the estimation of the pose is based on relative positions of features to an initial pose. The challenge in using these methods is to find the landmarks of the head with precision. Because of this, geometric methods are impacted by occlusions and undetected features.

2 .

6 .

4 Regression methods

These methods estimate the pose by learning a functional mapping from the image or feature data to a head pose measurement. Regression tools can be used on relatively low-dimensional feature data extracted if the location of facial features are known (Murphy-Chutorian and Trivedi,2009). Approaches using

Support Vector Regressors have shown success in head pose estimation (Li et al.,

2000; Murphy-Chutorian and Trivedi, 2010).

2.6.4.1 Discriminative Random Regression Forests

Some of the most recent and promising approaches for head pose estimation use Random Forests given their capability to manage large data, high generalization power, fast execution, and ease of implementation.

Our method is based on the Fanelli et al. (2011b) work, an approach for

estimating location and orientation of a human head, from depth information acquired using low-cost devices. This approach is robust to the poor signal-to-noise ratio and partial occlusions; it works frame by frame and therefore it needs no initialization.

(44)

Figure2.13: Example of Discriminative Random Regression Forests. A patch was ana-lyzed within two trees, the first tree discarded the vote while the second accepted the vote. The multivariate Gaussian distribution (head pose) stored

at the leaf node is extracted. Figure reproduced from (Fanelli et al.,2011b)

This method learns a mapping between depth features on patches of scenes (positive and negative examples) and the real values of the pose in each scene (three-dimensional translation and rotation). When there is a new scene, a window is moved through the scene to classify patches and obtain votes. The votes identifying to a head are clustered and the final head pose is estimated.

This method uses an extension of Random Forests. It makes distinction between patches that belong to a head and those which do not (classification) and uses the patches that are classified as a head to estimate the head pose (regression); this extension is called Discriminative Random Regression Forests (DRRF) which contains a set of decision trees that allow for two tasks: to separate test data into whether or not they belong to an object of interest and, in positive occurrences, to vote for the real valued variables. A simple DRRF is shown in Figure2.13, the patch is classified by each tree and only in the cases where the patch is classified positively, the leaf returns a multivariate Gaussian distribution that was calculated in the training process and stored in the leaf.

(45)

Figure2.14: Positive (blue) and negative (green) samples in a range image indicating the location of the head (red). Range images are manually labeled off-line with the location and orientation of the head.

Training process

In the training process, this method uses depth images each labeled with the head location and orientation. As shown in Figure2.14, the method selects patches of a fixed size from inside the region where the head is located labeling them as positive instances and from outside the region labeling them as negative instances.

A tree T in the forestT ={Tt}is built from the set of patches{Pi= (Ii,ci,θi)}

sampled from the training images, where Ii is the set of depth patches that have one to four feature channels, c_i ∈ {0,1} are the class labels, and θ_i =

{θx,θy,θz,θyaw,θpitch,θroll} is a vector that contains the offset between the 3D

point of the center of the patch and the center of the head in the image, as well as the Euler rotation angles describing the head orientation.

With this data, the trees can be built using the random forests framework. At each non-leaf node; beginning with the root, a test is selected from a large and randomly generated set of possible binary tests. Therefore, the binary test at a non-leaf node is defined astf,F1,F2,τ(I):

|F1|−1 X

q∈F1

I(q) −|F2|−1 X

q∈F2

I(q)> τ (2.4)

(46)

regions within the patch andτis a threshold. This test can be efficiently computed using integral images as in section 2.5.1.1. If a patch satisfies the test, it is sent to the right child, otherwise to the left child. During the construction of the tree, for each non-leaf node, a set of binary tests{tk}is generated with random values for

F1,F2, andτ; it is then evaluated with the set of patches accepted in this node. The

selected test is the one that maximizes the specific optimization function shown in Equation2.5.

arg max

k

(UC+ (1.0−e−

d λ₎_U

R) (2.5)

whered is the depth of the node and λis the steepness of the change. U_C

is a classification measurement and UR is a regression measurement; these are

used to evaluate the quality of a split. They are defined in Equations2.6and2.7 respectively.

U_C({P|tk}) = |PL|·

P

cp(c|PL)ln(p(c|PL)) +|PR|·

P

cp(c|PR)ln(p(c|PR))

|PL|+|PR| , (2.6)

wherep(c|P) is the ratio of patches belonging to classc∈{0,1}in the setP.

U_R({P|tk}) =log(|Σv|+|Σa|) − X

i={L,R}

w_ilog(|Σv_i|+|Σa_i|), (2.7)

whereΣv and Σa are the covariance matrices of the translation and rotation angles and wi=L,R is the ratio of patches sent to each child node.

The data is then split using the selected tests and the process continues until a leaf is created with one of the two following conditions being met: the maximum tree depth is reached or less than a certain number of instances remain. Each leaf stores two kinds of information, the ratio of positive patches that reached the leaf during training p(c=1|P)and the multivariate Gaussian distribution calculated from the pose parameters of the positive instances (Fanelli et al., 2011b).

(47)

Pose estimation

The pose estimation is done by sliding a fixed window (patch) over the depth image and classifying this window with the random forests classifier; a step for moving the window can be defined to find out the trade-off between speed and accuracy. Therefore, the test lead each patch to a leaf, but not all the leaves are considered for regression, only the ones where the rate p(c=1|P) = 1 and the total variance for Gaussian distributions does not overpass an empirical threshold

maxv.

Then, bottom-up clustering is computed in order to remove outliers and group votes into big clusters, each cluster representing a head in the range image. Subsequently, a mean-shift clustering using a spherical kernel is performed to localize the centroid of the clusters. If in a cluster there are more votes than a certain threshold, we declare that the cluster belongs to a head, whose multivariate Gaussian distribution is the sum of the remaining Gaussian distributions, such that its mean is the estimated set of output parameters identifying the head pose.

2.7 Registration

Registration is a process that integrates data obtained from different sources into a single coordinated system. It aims at detecting the relative transformation of two or more images (views) in order to generate a global consistent model.

It consists of finding corresponding points between the images and estimating a transformation that minimizes the distance between corresponding points. The process is repeated until a number of iterations is done or the error rate is lower than an established threshold.

The problem of registering a pair of images is called pairwise registration. Through the registration, a transformation matrix that represents the translation and rotation needed to align the two images (source-target) is generated.

(48)

28 _Registration

2 .

7 .

1 RANSAC

The RANSAC (RANdom SAmple Consensus) algorithm was proposed byFischler and Bolles (1981) in order to determine a point in space where the camera

was located given a set of landmarks (Location Determination Problem) with known locations. This algorithm is capable of working with data containing large amounts of outliers.

As opposed to other algorithms that assume that the data consists purely of inliers, RANSAC uses a small number of landmarks (randomly sampled) as the initial dataset and enlarges this dataset with consistent data when possible. Thus, the RANSAC algorithm identifies outliers in the dataset and removes them. If there are enough inliers, RANSAC computes the model transformation with those compatible landmarks.

There have been multiple modifications to the algorithm (Torr and Zisserman,

2000; Buch et al., 2013) but they essentially follows the hypothesize-and-test

framework (Zuliani,2014):

• Hypothesize: To Randomly selectminimal sample setsfrom the input dataset with size sufficient for estimating the model parameters.

• Test: To check the elements in the entire dataset that are consistent with the parameters estimated in the first step. The set of consistent elements is called a consensus set.

2 .

7 .

2 Normal Distribution Transform

The Normal Distribution Transform algorithm (NDT) is a registration algorithm proposed byBiber and Straßer (2003) that uses standard optimization techniques

(Newton’s algorithm) applied to statistical models.

The method subdivided the data into cells, assigning each cell a normal distribution, which models the probability of measuring a point locally. The probability density is then used to match another dataset. This method does not have to establish explicit correspondences.

(49)

As the original approach is proposed for two-dimensional data, Magnusson

(2009) extended the original algorithm to three-dimensional data. It exploits the

NDT surface representation to create histograms based on local surface orientation and smoothness. The histograms of both source and target data are then matched using optimization techniques.

2 .

7 .

3 Iterative Closest Point

The Iterative Closest Point (ICP) registration is a technique that uses geometric in-formation. Proposed by Besl and McKay(1992), it tries to minimize the difference

(translation and rotation) between two point clouds, a target that actually does not move at all, and the source that is transformed to match the target.

The algorithm iteratively finds a better transformation for the source in order to align it with the target, then a data shape Pis moved (transformed) to be in the best alignment with a model shapeX. The P and X models can be represented in any form and must be decomposed into point clouds.

The original strategy is to always use all available points, however, other strategies have been proposed as stated in (Rusinkiewicz and Levoy, 2001):

• Uniform subsampling of the available points,

• Random sampling (with a different sample of points at each iteration) or

• Selection of points with a high intensity gradient.

The first step for the registration of two point clouds is to determine the corre-spondences between them; for ICP those corresponding points are assumed to be the closest points measured by the Euclidean distance from the target to the source points (See Figure2.15).

The minimization problem begins when we want to transform the source relative to the target so that the sum of all the distances of corresponding points is minimized. Let T be a function of a transformation (rotation + translation) and letxsource_i represent an arbitrary point in the source point cloud andxtarget_i

(50)

30 _Registration Source Point Cloud Target Point Cloud Closest Euclidean Distance Points Real Correspondences

Figure2.15: Closest Euclidean distance points between target and source point clouds (green lines) vs. real correspondences (blue lines).

represent its corresponding point in the target point cloud; we can represent the transformation of the source point cloud as

T(xsource_i ) =Rxsource_i +t (2.8) Thus, the Euclidean distance is calculated by subtracting the transformed source point from the target point. Letnbe the subset of points chosen in each ICP iteration, so we can compute the total error in the cloud alignment using transformation T by summing all the distances between the subset considered. The mean squared error cost function to minimizef(R,t)is shown in Equation 2.9.

arg minR,tf(R,t) =

1 n

n

X

i=1

x

target i −T(x

source i ) 2

(2.9)

This means that for each pair of points the source-target is considered, the Euclidean distance is computed, squared, and added to the previous sum to finally get the mean squared error based on rotation and translation, this being what we want to minimize. In summary, the ICP algorithm is shown in Algorithm 1.

(51)

Algorithm 1ICP Algorithm. This algorithm receives two point clouds: target and source (PCtrgt andPCsrc respectively). It is repeated until a defined number of iterations are performed (maxIterations) or the calculated mean squared error is below the threshold. If this last condition occurs, ICP finishes and returns the transformation matrix.

Require: PCtrgt, PCsrc, threshold

Ensure: transformation

1: transformation←IdentityMatrix 2: repeat

3: for allxsrc

i ∈PCsrc do

4: arg min

xtrgt_i ∈PCtrgtf(x

trgt i )←

x

src i −x

trgt i

5: end for

6: arg min_R_,_tf(R,t)← 1

n

Pn i=1

xtrgt_i − (Rxsrc_i +t)2

7: transform(PCsrc,R+t)

8: transformation←transformation∗(R+t) 9: until(error < threshold)or maxIterations 10: return transformation

(52)

(53)

R

elated work

Augmented Reality is an area of research which has yielded applications in many areas. In this chapter, we present some research and applications that has been developed in recent years especially for the medical field.

Pose estimation has been studied for a long time and has been addressed in different ways from using geometry to machine learning techniques. Recently, the importance of 3D data in perception has lead to efforts in researching new regis-tration techniques. Therefore, we present state-of-the-art methods for regisregis-tration and recent approaches in head pose estimation. We also present some approaches addressing the issue of skin segmentation.

3.1 Applications of Augmented Reality

AR has been applied in a many areas with many innovative ways to use it (See Figures3.1 and3.2). The most common areas (Van Krevelen and Poelman,2010; Carmigniani and Furht, 2011) are: commercial and industrial (Friedrich et al., 2002; Ong et al., 2011), education and entertainment (edutainment) (Billinghurst et al.,2001;Piekarski and Thomas,2002;Caarls et al.,2009;Ippoliti et al.,2012),

(54)

34 _Applications ofAugmentedReality

Figure3.1: Volkswagen test scenario application. A pre-calculated damage model (red

mesh) is superimposed on the tested vehicle. Figure reproduced from

(Friedrich et al.,2002).

Figure3.2: ARQuake: an AR video game. Virtual elements enhance the real world

through a half-silvered mirror. Figure reproduced from (Piekarski and Thomas,

(55)

3 .

1 .

1 Augmented Reality and medicine

In medicine, physicians may have multiple advantages when using AR systems, from diagnosing patients through surgical planning and into the operating room. Valuable information is displayed in either a Head-Mounted Display (HMD), lenses or a display. Augmented reality has helped physicians and nurses in diagnosis, pre- and post-operative; outside the hospitals AR is used in learning, training, and to help patients with treatments and rehabilitation.

AR has brought helpful tools and solutions to the medical field; however these systems require high levels of precision when aligning virtual objects as well as handling occlusion properly to benefit the user (Leal Meléndrez, 2012).

The ability to improve surgical procedures has increased the utilization of AR for minimally invasive surgery reducing postoperative pain and patient recovery time. Surgeons cannot see beyond the exposed surfaces, for instance, the visual field lacks spatial information regarding the internal anatomy which increases the necessity of image guidance (Gering et al.,2001).

The growing amount of applications for diagnosis in medicine have attracted the interest in integrating several areas of research in life sciences, as well as computer science. One of the aims in the future of medicine is to create new methods that allow the development of minimally invasive precision diagnosis and therapeutic techniques for treating diseases (Liao, 2011).

Open surgical explorations continue to be the most commonly applied tech-nique regardless of the existence of less invasive methods. Together with new medical instrumentation medical imaging can help reduce the invasiveness of surgery using spatial information and improving outcomes.

In order to perform minimally invasive surgeries, surgeons use image-guided surgery technology (Grimson et al.,1996; Gering et al.,2001; Marescaux et al., 2013). AR researchers take advantage of this technology to develop new methods

to improve medical diagnosis and surgeries, allowing for virtual visibility of internal organs with the ability to highlight desired organs, anatomical details or specific problems such as tumors. On the other hand, all the information that is

(56)

36 _Applications ofAugmentedReality

Figure3.3: Simulated visualization in laparoscopic surgery proposed by Fuchs et al.

(1998).

managed by the surgeons could reduce their natural senses and intuition as they have to deal with additional information, therefore, it must exist a compromise between the real and the virtual information to correctly assist the surgeon (Dixon et al.,2013), combining the real environment with medical images or the

interpretation of them provides an enhanced navigation tool allowing for the consideration of fine details (Marescaux et al.,2013).

Before AR became popular in the medical field, works were being explored in the area with potential for great breakthroughs. Such was the case (Grimson et al.,1996) in which 3D models were generated from Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), while3D-data points of the scene were obtained using a laser range scanner. The model was manually registered for an initial alignment with a user interface before applying its final registration.

One of the pioneering efforts to apply AR in medicine wasFuchs et al.(1998).

This was one of the first applications of Augmented Reality in the medical field pre-senting the prototype implementation of a visualization system for laparoscopic surgery (Figure3.3); even though it obviously lacked many of the characteristics of an actual AR system, it highlighted the subject and new ideas were created.

Since that time, both medicine and AR have made huge advances. Navab et al. (2012) presented three novel systems developed at the Technical University

of Munich. The first is Freehand SPECT (Single-photon emission computed tomography), a prototype that has become a commercial product and which is

(57)

now used in several hospitals around the globe. It is based on nuclear medicine, a medical tool where tracers are injected to the patient to target specific organs or body parts. Freehand SPECT allows for the reconstruction of 3D tomographic images from a handheld detector (Bluemel et al., 2013). It requires retroreflective

markers to serve as a reference coordinate system for tracking, one is attached to the handheld detector and the other is placed on the patient’s body. After being scanned for about two minutes, the object is reconstructed in about a minute; using AR techniques the3D object is overlaid on the patient allowing for virtual observation.

Another system presented in the previously mentioned article is the Camera Augmented Mobile C-Arm (CAMC). This system uses AR technology to project X-Ray images used in orthopedic and trauma surgery. Using a mirror construction, the camera has the same view of the patient as the X-ray camera, therefore the images are registered. Even though it does not need an additional tracking tech-nology, it requires attaching fiducial markers to avoid misalignment; therefore, the surgeon is informed when the person makes a movement. As radiographies are represented in 2D images, slight movements could lead to the misinterpretation of the image alignment.

The third project described in (Navab et al., 2012;Blum et al.,2012) is Mirracle,

an application that extends the concept of the magic mirror —a commercial application in the fashion industry— to visualize the user’s anatomical parts. It was developed as a tool for training and learning anatomy. The user uses a display as if it were a mirror that overlays organs or internal parts of the body creating the illusion of seeing inside the body. To calculate the user’s pose, they took advantage of the Kinect sensor (that also added interaction to the system) (See Figure3.4). Different from other approaches, their work is not patient dependent as it uses the Visible Korean Human dataset which allows for the presentation of different kinds of medical images. Therefore system precision is an issue.