Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

(1)

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

Monterrey Campus

School of Engineering and Sciences

Learning Temporal Features of Facial Action Units Using Deep Learning

A thesis presented by

Roberto S´anchez-P´amanes

Submitted to the

School of Engineering and Sciences

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Science

(2)

(3)

(4)

(5)

Dedication

I would like to express my deepest gratitude to all those who have been side by side with me:

my wife, kids, parents, and family. Thank you all for your unconditional confidence, support, patience, and encouragement. You were my main motivation for pushing through this work.

(6)

(7)

Acknowledgements

This work was possible partly because the tuition support and computing resources provided by Tecnol´ogico de Monterrey. Thanks to CONACyT for the scholarship provided. I also thank my advisor, Dr. Santiago Conant, that was always there to guide the project through these last two years, and encouraged me to participate in a research stay where I obtained valuable knowledge.

(8)

(9)

Learning Temporal Features of Facial Action Units Using Deep Learning

by

Roberto S´anchez-P´amanes Abstract

Facial expressions are an important aspect of human life and research on this topic has led to real-world technological applications. The task of recognizing facial states is involved in a collection of challenging tasks that include assisting elders and babies, as well as enhancing pedagogical exercises. Unlike categorizing faces into emotions, the Facial Action Cod- ing System encode ambiguous expressions by analyzing small differences in the face based on muscle movements called action units. By analyzing action unit co-occurrences, human coders can virtually create any anatomically possible facial scenario that is independent of interpretation and can be used as a tool for higher-level decision processes. The automatic detection of action units in videos has recently become an interesting topic for the deep learning community since models of this area have dramatically improved the performance in image- related tasks. The state-of-the-art proposals in the benchmark database FERA17 are currently vanilla implementations of convolutional neural networks that model the occurrence of action units by ignoring their temporal features. However, rather than being like a single snapshot, the occurrence of independent facial movements changes over time in response to information dynamically gathered from the environment, thus these deep models cannot completely capture the complex dynamic context involved in their occurrence.

Researchers have engineered other deep learning methods that possess the ability to learn features across sequences of images. These procedures can be grouped into three categories, 1) methods that extend image-based architectures by using aggregation methods, or 2) recurrent units, and 3) methods that are able to process spatiotemporal features natively. They all offer the possibility of capturing AU dynamics and enhance their detection. However, their study has been frequently overlooked by the facial expression recognition community, particularly for AU occurrence detection, and up to these days, it is unclear whether deep learning models that incorporate temporal features can indeed outperform those who do not.

This work analyzes the effects of incrementally adding temporal capabilities to the spatial model ResNet50 on predicting the occurrence of a single action unit of the FERA17 database. Configurations evaluated include inflating the kernels in the model to create a 3- dimensional version of ResNet50, adding a recurrent layer to encode long-term dependencies, and including the dense optical flow representation of two consecutive periods of time. Results show that adding recurrent units to a spatial model out-performs other temporal paradigms and the baseline ResNet50 by 7.4% considering the F₁ score. The discoveries placed in this thesis can be utilized to better define deep learning initial implementations for projects related to facial expression recognition. Knowing the extent to which each temporal paradigm can effectively capture the dynamics inherent to AU occurrence, future research projects can be improved.

(10)

(11)

List of Figures

2.1 Samples of databases with different experimental protocols: spontaneous (a), posed (b), and in-the-wild (c). . . 12 2.2 Behind the scenes of a CNN. The outputs of each layer in a typical convolu-

tional architecture corresponds to a feature map. Information flows bottom up and a score for each output is computed with an activation function, in this case a rectified linear unit (ReLU). Figure obtained from [38]. . . 14 2.3 Regardless of the type of image that is processed by a CNN, e.g. grayscale

(a) or RGB (b), kernels learn to produce features from single instances at each time step. There is no connection between input image i and i + 1. . . 17 2.4 Dense optical flow representation of two sequenced frames of a person in a

tennis court. . . 17 2.5 Common classification pipeline that integrates a CNN feature extraction phase

and a sequence learning phase using LSTM cells. . . 19 2.6 Long short-term memory cell. Illustration obtained from Graves [21]. . . 19 2.7 Output shape of a vanilla convolution operation over a 1-D image (a), a multi-

channel image (where channels could be other frames) (b), and a 3D convolution over multiple frames. . . 20 2.8 Romero et al. [56] concatenate a dense optical flow representation alongside

its RGB image to a VGG-16 backbone encoder for AU occurrence detection. 22 2.9 Tang et al. [66] propose an ensemble of 10 fine-tuned VGG-Face models. . . 23 2.10 He et al. [23] utilize an LSTM network to encode temporal relationships

between frames in a video. . . 23 2.11 Batista et al. [2] feature a concatenation of region-based learning and a holis-

tic approach. . . 24 2.12 Zhao et al. [78] propose region layers to capture more detail about the face. . 24 3.1 Nine different views of a subject. The FERA17 database consists in 2,952,

1,431 and 1,080 videos in the training, validation and test set, respectively.

Image retrieved from [66]. . . 26 3.2 Notched box plot displaying the distribution of the fraction of coded frames

in which an AU occurred, or occurrence rate, across all the videos in the train (a) and validation (b) sets. The dashed line in each box indicates the mean value of occurrence rate whereas the dots represent outliers. . . 27 3.3 ResNet-50 original architecture (left) and baseline model architecture (right).

Weights in the blue blocks are transferred and fixed, otherwise are trained from scratch. . . 31

(12)

3.4 Various ResNet configurations [24]. . . 31 3.5 Block diagram of each experiment designed to test the hypothesis. Parameters

in blue blocks are fixed, otherwise are updated following the learning rule.

Top circle represents a single neuron that classifies the input instance with or without AU12. . . 32 3.6 Optical flow representation (c) of two consecutive frames at times t (b) and

t − 1 (a). Both (a) and (b) are shown as RGB images, however the optical flow is computed between their grayscale representation. Even though images (a) and (b) seem identical, the minimal normalized differences between them are highlighted in (c). Images in this figure are resized to 224 × 224 × 3 for illustration purposes. . . 35 3.7 Optical flow representation of two consecutive frames used in image sequence

volumes. . . 37 4.1 Behavior of ResNet-50 trained with RGB images using different hyper-parameters

in experiment E1. . . 43 4.2 Behavior of ResNet-50 trained with OF images using different hyper-parameters

in experiment E2. . . 44 4.3 Behavior of ResNet-50 connected to an LSTM, both trained with RGB images

using different hyper-parameters in experiment E3. . . 46 4.4 Behavior of ResNet-50 connected to an LSTM, both trained with OF images

using different hyper-parameters in experiment E4. . . 47 4.5 Behavior of 3DResNet-50 models trained with RGB images using different

hyper-parameters in experiment E5. . . 49 4.6 Behavior of 3DResNet-50 models trained with OF images using different

hyper-parameters in experiment E6. . . 50 4.7 Standard error bars on each epoch for the five replicas done for experiment

E1 (a), E2 (b), E3 (c), E4 (d), E5 (e), and E6 (f). . . 52 5.1 Information flow in a ConvLSTM. Figure obtained from Shi et al. [60]. . . . 57

xii

(13)

List of Tables

2.1 Action Unit index and its corresponding FACS name. This table shows the main AUs that are related to facial emotional responses. There are more AU related to eye or head movements, gross facial behavior like lip bite, and some that indicate whether parts of the face are visible or not. There is no AU3 in FACS, although it is used to refer to a specific brow action in specialized versions of FACS intended to be used in infants [8]. . . 9 2.2 Emotion predictions based on prototypical AUs activation. While there are

more variants for each emotion, these are some of the ones presented in the FACS Investigator’s Guide [12]. Asterisk (*) means that the AU may be at any level of intensity. Intensities are defined from A-E, for minimal-maximal intensity. . . 10 2.3 AU-labeled databases statistics (LC: label cardinality, LD: label density, DL:

distinct label set, PDL: proportion of distinct label). . . 13 2.4 Best results submitted for the Facial Expression Recognition and Analysis

Challenge 2017 (FERA17) [47]. Top submissions are based on convolutional neural networks (CNN). . . 21 3.1 Number of positive annotations for each action unit in both train and valida-

tion set. . . 28 3.2 Ranking of occurrences of each AU on the FERA17 database. . . 28 3.3 Performance achieved by the best submissions of the FERA17 Challenge on

detecting the occurrence of AU12 (NR - not reported). . . 30 3.4 Performance obtained by training an LDA classifier on the features extracted

by several CNN models. Numbers in parenthesis indicate the rank of the performance on the DISFA database. . . 30 3.5 Temporal characteristics of the defined models used to test the hypothesis

(Agg is for Aggregation, Rec. for Recurrent units, and ST for Spatiotemporal). 32 3.6 Parameters selected for computing the optical flow using Gunnar Farneback’s

algorithm implementation on OpenCV. For a further description of each element please refer to the python library’s original documentation. . . 35 3.7 Different configurations of the original FERA17 database created to be used

by the experiments in this work. All configurations were generated in around 1.5 hours. . . 36

(14)

4.1 Analysis of the independent impact that design variables have on each performance metric defined in the scope of this thesis project. The number in parenthesis indicates the performance increase of the best variable compared to the worst one. Note that in the case of FP and FN, the best performances are given by those variables that showed a lower amount of counts. . . 53 4.2 Performance of the best settings analyzed in this thesis research. Each entry

describes the epoch at which a particular model obtained its best score in terms of F1 score. Bold letters indicate the model with the best performance of all. . . 54

xiv

(15)

Chapter 1 Introduction

Today we live in a world where the interaction between humans and computers has penetrated so deep that there are research fields (i.e. pervasive or ambient intelligence) which attempt to send computation to the background of human’s daily tasks. Technologies will become more and more invisible until the user interface is the only component perceivable by the user [55]. The term human-computer interaction (HCI) is defined as a discipline intended to design, implement and evaluate interfaces and interactive systems [74]. Next-generation human-computer interfaces must include the ability to recognize the affective states of the user in order to respond more adequately. As our naturally preeminent mean of communication, facial expression analysis is an indispensable part of HCI design.

Facial expression analysis methods have emerged since the last century [10], and today it is a relevant topic because machine understanding of the human state could revolutionize the way that we interact with computerized devices. In the last decade, researchers have been developing more intelligent human-computer interfaces capable of understanding human emotions and affect [44] mainly because of the advancement of technology, the huge amount of data that is being uploaded as videos, sensing, tracking and analyzing nonverbal communicative signals [49].

The task of recognizing facial states is called Facial Expression Recognition (FER). This task is involved in a collection of real-world challenging jobs that include assisting elders and babies and enhancing pedagogical tasks. Other substantial benefits from automatically inferring human facial expressions can be achieved also in traffic control, behavioral sciences, security agents, neurology and psychiatry.

There are two main streams in this field: emotion detection and facial muscle action detection. They surge from two psychological approaches [9], message and sign judgment, respectively. On one hand, the aim of message judgment is to infer what underlies a displayed expression, such as personality, to find a subjective meaning of a shown behavior. On the other hand, sign judgment addresses the appearance of what is displayed by the face, attempting to make an objective description of the shown behavior.

Facial Action Coding System (FACS), a sign judgment approach for FER first developed by Carl-Herman Hjortsj¨o [25] and later adopted by Ekman and Friesen [11], is one of the most popular observer-based measurement systems in this domain. FACS fragments the face into Action Units (AU) which are based on facial muscle movements. With AUs temporal segments, human coders can virtually create any anatomically possible facial scenario that

(18)

2 CHAPTER 1. INTRODUCTION

is independent of interpretation and can be used as a tool for higher-level decision processes.

Because of its descriptive power, FACS is regarded by many as the standard measure for facial behavior and is used widely in diverse fields [8].

The task of detecting AUs in face images has been tackled with conventional machine learning techniques. However, these methods suffer from image data because of their limited ability to process natural data in their raw form [38]. Therefore, most approaches rely on hand- engineered feature extractors that transform pixel information into a certain representation, a process that requires careful engineering and domain expertise. Learning the representation of a face in terms of AUs can also be achieved using methods that do not require a human to design hand-crafted feature extractors. Deep learning (DL) methods are representation learning methods to which raw data can be fed because they use general-purpose learning procedures that auto-adjust their internal parameters to detect or classify patterns in the input. DL models have dramatically improved the state-of-the-art performance in speech recognition, visual object recognition, object detection, drug discovery and genomics [38]. Convolutional neural networks (CNN) are a particular type of deep learning model that has a known grid-like topology [19], making them a natural match for image-related tasks.

The IEEE International Conference on Automatic Face & Gesture Recognition is one of the most important forums for research in image and video-based face, gesture, and body movement recognition. Literature shows that there has been an increasing submission rate of DL approaches for facial analysis. In the most relevant AU-labeled databases, analyzed in Section 2.2, the state-of-the-art results are based on convolutional neural networks.

1.1 Problem Statement and Motivation

Rather than being like a single snapshot, AU occurrence changes over time in response to information dynamically gathered from the environment. According to the FACS, an AU can be either in the a) onset phase where muscles are contracting, b) apex phase where it reaches a maximum expression, c) offset phase where muscles are relaxing, or d) neutral phasewhere there are no signs of activation. Studies using FACS suggest that facial actions unfold sequentially and converge asynchronously towards a global apex [14]. Therefore, the combination of co-occurring AUs, each in its own phase, makes it impossible to confidently deduce someone’s expression by just analyzing a single face image.

One of the major restrictions of traditional CNNs is that they only extract spatial relations of the input data while ignoring their temporal relations if they are part of a sequenced data. Given the fact that AUs posses an activation cycle that lasts for several hundred millisec- onds, traditional CNNs cannot completely capture the complex dynamic context involved in their occurrence. For example, in a video database, each frame is processed individually to get a prediction for AU occurrence. Meanwhile, information about the evolution of AU ac- tivations across frames is not captured. In this way, traditional CNN approaches ignore the inherent temporal structure of action units.

Researchers have carefully engineered CNN-based methods that possess the ability of temporal understanding of a sequence of images. Deep learning procedures with this ability can be grouped into three categories determined by the way they interpret time, 1) methods that extend image-based deep learning architectures and capture time by using aggregation

(19)

1.2. HYPOTHESIS 3

methods, or 2) recurrent units, and 3) methods that are able to process spatiotemporal features natively. Applied to the action unit occurrence detection task, those methods offer the possibility of capturing AU dynamics that may give certain cues for improving their detection.

However, their study has been frequently overlooked by the FER community, particularly for AU occurrence detection, and up to these days, it is unclear whether deep learning models that incorporate temporal features can indeed outperform those who do not in the task of AU occurrence detection. There is an absence of papers in the literature that analyzes this phenomenon in a global and recent way.

1.2 Hypothesis

Deep learning procedures that can capture temporal features across a sequence of images of human faces, can out-perform those methods that capture only spatial features in the context of action unit occurrence detection. The main problem areas concern the following:

• How does the inclusion of time features in the learning phase affects the training time of deep learning models?

• How is the ability to learn context information from a sequence of images associated with the performance of deep learning models?

• How can a deep learning model be enhanced from being spatial-only, into context- aware?

1.3 Objectives

This research aims at assessing whether deep learning architectures with temporal understanding of image sequences can outperform those that are not capable of such interpretation ability.

The particular objectives are:

• Select an AU-labeled database with adequate statistical and quality properties so that it can be used to compare deep learning methods.

• Define the spatial-only model to be used as a baseline for testing the hypothesis.

• Identify the different ways in which context in an image sequence can be learned by a deep learning model.

• Define a set of deep learning architectures with the ability of temporal understanding.

• Test the hypothesis under a common performance metric.

1.4 Methodology

This section describes the steps that are done to test the hypothesis and accomplish the objectives previously stated. The following elements are sorted by priority, starting by defining the

(20)

4 CHAPTER 1. INTRODUCTION

Define a video-based AU-labeled database

The first step of the research involves the selection of an AU-labeled database with videos in a controlled environment to avoid adding noise to the learning procedure and help the models struggle less at discriminating irrelevant aspects of the video. This process involves research on the types of databases that exist and their characteristics. Additionally, the statistical properties of multi-label datasets are studied so that the set of final candidate databases can be objectively contrasted, and a single one can be selected.

Specify the spatial-only baseline model

To prove if the null hypothesis holds, i.e. spatial-only deep learning procedures show better performance on AU occurrence detection than temporal ones, it is critical to define a representative deep learning architecture that does not capture features across frames, while achieving a competitive performance among similar spatial models.

Identify the different ways to capture time in image sequences with deep learning Perform literature research on the three ways in which deep networks can capture time. Eval- uate if the scheme or domain in which they are used can be compared with AU occurrence detection.

Define the set of deep learning procedures with the ability of temporal understanding to test the hypothesis

Define a representative deep neural architecture for each of the context learning paradigms identified. The selection strategy is done in a way that makes it possible to tease apart those elements that benefit AU occurrence detection.

Conduct experiments and evaluate results

Train the set of spatial and temporal models using the selected AU-labeled database. Define a performance metric to evaluate them individually, and across models. Accept or reject the null hypothesis using the shown behavior.

1.5 Contributions

The research presented in this thesis makes several contributions to the scientific community, being:

• A comparison between the three different paradigms of time understanding that have been proposed in the literature. To the best of the author’s knowledge, there is no work that directly compares the three temporal paradigms in a global way on the FACS

(21)

1.6. DOCUMENT STRUCTURE 5

domain. This is valuable for deep learning researchers that are diving into the facial expression recognition task because it gives a clear guide of which group of architectures behave better. In this way, the solution design process can be drastically reduced.

• Incorporating the results shown here can potentially increase the performance of the state-of-the-art models proposed in the literature on action unit detection.

• A finer way to evaluate the performance of deep learning models in facial expression recognition is proposed. Standard metrics used in the literature are too coarse to effectively assess the capabilities of a model when temporal features are involved. Rather than comparing only the F1 score between models, this research proposes the use of true positives, true negatives, false positives, and false negatives to infer what path the model takes while learning. This method is shown to be key in defining which deep learning models better learn the generalities of data.

1.6 Document Structure

The rest of the document is organized as follows: Chapter 2 provides relevant information about the theoretical background on AU-related databases, deep learning and temporal analysis of action units; Chapter 3 describes the Solution Model; Chapter 4 presents the exper- imentation and results; finally, Chapter 5 concludes the thesis project and proposes future

(22)

(23)

Chapter 2 Theoretical Background

Concepts that are relevant to the definition of action units and their properties are shown in this chapter. In addition, the knowledge found in the literature around the temporal feature learning of AUs using deep learning is concentrated here.

2.1 Facial Expression Recognition

Initially, the problem was approached by focusing on its most basic version: six prototypical emotions, i.e. happiness, sadness, fear, disgust, surprise and anger, all of them which are recognized across cultures [28]. However, recent classification methods have almost reached saturation on benchmark databases for this task. Fortunately, Ekman et al. [11] refined the Fa- cial Action Coding System (FACS) that aims at systematizing the study of facial expressions.

Unlike categorizing faces into emotions, FACS can encode ambiguous expressions by analyzing small differences in the face. This system treats atomize the face into local appearance variations called Action Units (AU) that are produced by facial muscle movements. Table 2.1 shows the list of emotion-related Action Units. Any facial expression can be represented as a combination of AUs. Although Ekman and Friesen proposed that a combination of particular AU represent prototypical expressions of emotions, these relations are coded in separate systems such as EMFACS [15]. FACS intensities are also annotated by a discrete 5 level rank system and can have modifiers such as ’R’ and ’L’ that distinguish AUs that occur on the right or left side of the face, respectively. Table 2.2 describes some examples of emotion-related facial actions and their major variants.

FACS is regarded by many as the standard measure for facial behavior [8]. This atom- isticway of analyzing the face opens the door for studying a wide range of applications beyond prototypical expressions:

• Detect pain [41].

• Determining fatigue while driving.

• Detecting anomalies on ATM money withdrawals.

• Helping patients with autism that struggle to express emotions.

(24)

8 CHAPTER 2. THEORETICAL BACKGROUND

• Aid at diagnosing depression [53].

• Adjusting the difficulty of a video game depending on the stress levels expressed.

• Emotional impact of advertising.

• Distinguish genuine and fake smiles [17].

2.2 AU-labeled Databases

Databases that include image sequences or videos are good candidates to be analyzed for temporal feature evaluation because they yield information about the evolution of action units through time. Video-based techniques are known to have a higher recognition rate than static image processing because it provides additional temporal information [35]. There are several AU-labeled databases that present image sequences or videos of subjects that allow models to capture the dynamic behavior of facial muscles.

After reviewing 61 databases, Weber et al. identified three types of FER databases based on their experimental protocol, posed, spontaneous or in-the-wild. Figure 2.1 shows the differences between the three groups. Below is a brief description of each of them, giving special detail to video-based spontaneous AU-labeled databases.

In-the-wild Databases

Most databases are acquired in the laboratory with a plain background which can cause models to perform poorly on real-world conditions. In-the-wild databases propose to tackle this problem since they offer high variability in lightning and background. In-the-wild conditions refer to more real life context, i.e. unconstrained population, experimental conditions and emotional content. An example is the Affectiva-MIT Facial Expression Dataset (AM-FED) [46] that presents a set of videos recorded from 242 webcams in real-world conditions over the internet of people naturally viewing online media.

Posed Databases

In a posed scheme, subjects are asked to display specific facial deformations. Improvising on an emotionally rich artificial scenario helps to get more realistic posed expressions. This experimental protocol allows to perfectly control the reproducibility of information. An example of a posed database is the laboratory-controlled MMI database [50] that includes 326 sequences from 32 subjects. Each participant was asked to display 79 series of expressions that included single actions units or a combination of them. Image sequences begin with a neutral expression, reaches a peak near the middle before returning to the neutral expression.

Spontaneous Databases

Spontaneous expressions, on the other hand, are captured by controlling the experimental protocol with relatively standardized tasks that allow the subject to react naturally to them

(25)

2.2. AU-LABELED DATABASES 9

Table 2.1: Action Unit index and its corresponding FACS name. This table shows the main AUs that are related to facial emotional responses. There are more AU related to eye or head movements, gross facial behavior like lip bite, and some that indicate whether parts of the face are visible or not. There is no AU3 in FACS, although it is used to refer to a specific brow action in specialized versions of FACS intended to be used in infants [8].

AU Number FACS Name

1 Inner brow raiser 2 Outer brow raiser

4 Brow lowerer

5 Upper lid raiser

6 Cheek raiser

7 Lid tightener

8 Lips toward each other

9 Nose wrinkler

10 Upper lip raiser 11 Nasolabial deepener 12 Lip corner puller 13 Sharp lip puller

14 Dimpler

15 Lip corner depressor 16 Lower lip depressor

17 Chin raiser

18 Lip pucker

19 Tongue show

20 Lip stretcher

21 Neck tightener

22 Lip funneler

23 Lip tightener

24 Lip pressor

25 Lips part

26 Jaw drop

27 Mouth stretch

(26)

Table 2.2: Emotion predictions based on prototypical AUs activation. While there are more variants for each emotion, these are some of the ones presented in the FACS Investigator’s Guide [12]. Asterisk (*) means that the AU may be at any level of intensity. Intensities are defined from A-E, for minimal-maximal intensity.

Emotion Prototypical AU Surprise 1+2+5B+26

1+2+5B+27

Fear 1+2+4+5*+20*+25 1+2+4+5+25 Happy 6+12*

12C or 12D Sadness 1+4+11+15B

1+4+15 Disgust 9+16+15

10*+16+25 Anger 4+5*+7+23 4+5*+7+24

[72]. As opposed to posed databases, spontaneous expressions provide information about genuine emotional states, which is desired for real-world applications. A model trained on posed information will have lower performance when testing on spontaneous expressions [72].

The desired characteristics that a database must contain for the purposes of this research are defined below.

• Clean background to avoid noise. Being able to generalize for various backgrounds is not a priority; therefore, the less external information, the less biased knowledge is seized.

• Plenty of samples to train a deep model.

• Selection of AUs that involve course facial movements so its activation cycle can be captured in several frames. The present work makes the assumption that the ability of a model to capture temporal features is irrelevant to the quantity or types of action units used for training, however, those that go through their activation life cycle faster may not be correctly represented when sampling.

• The database experimental protocol adopted is assumed as irrelevant. However, in- the-wildschemes could violate the first characteristic mentioned related to input noise because of their unconstrained nature.

• Sufficient image resolution so a deep model can detect subtle changes in the face.

After identifying the types of databases that exist on this domain and the desired characteristics that attain this research, the following paragraphs describe a set of AU-labeled

(27)

2.2. AU-LABELED DATABASES 11

spontaneous databases that allow researchers to analyze temporal features by offering image sequences or videos.

Extended Cohn-Kanade Dataset The Extended Cohn-Kanade dataset (CK+) is the union of an original (posed) distribution called Cohn-Kanade Dataset (CK) [30] and an augmentation (spontaneous) made by the same authors to include a total of 593 image sequences from 123 subjects. Participants were asked to perform a series of facial displays that included single and combinations of action units. Image sequences are either 8-bit grayscale or 24-bit color coded and vary in duration from 10 to 60 frames. Each sequence starts with the subject in a neutral position and ends with the peak formation of the facial expression.

Denver Intensity of Spontaneous Facial Action (DISFA) Database The DISFA database [45] contains the spontaneous facial actions videos of 27 adult subjects with different ethnic- ities. Each subject was recorded for 4 minutes and every frame is coded by a FACS coder.

Action unit intensities and 66 facial landmark points for each image are also included in the database.

BP4D-Spontaneous The BP4D database [76] is a 3D video database that includes the videos of 41 participants who were recorded during a series of emotion elicitation tasks. The procedures were designed to elicit: happiness, sadness, surprise, embarrassment, fear, physical pain, anger, and disgust. The BP4D+ Database [77] is an extension of the same database that includes digital videos of 30 more participants, and thermal and physiological data. The databases were annotated frame-by-frame by a team of experts, using the Facial Action Cod- ing System (FACS) [11]. Segments of the most facially-expressive 20 seconds of each task were selected for coding.

Facial Expression Recognition and Analysis 2017 (FERA17) Database The FERA17 database [47] is divided into two parts: training and test set. The former is derived from the BP4D-Spontaneous database and the latter from a subset of the BP4D+ database. For FERA17, the same recordings found on BP4D and BP4D+ were augmented by rotating each participant’s face using the 3D facial information of the videos. The data is available in the form of videos without audio that is annotated frame by frame with 10 different AUs. This database has also the annotations for AU intensity, but it’s only annotated on a subset of 7 AUs.

Most facial expressions are made up of combinations of AUs; therefore, the presence of one can help at refining the occurrence probability of others. Classifiers that can exploit these dependencies are likely to lead to more reliable results [6]. In machine learning, classification problems can be defined as single label when the goal is to learn a relationship between a set of instances and a unique class label from a set of disjoint class labels L. Depending on the total number of classes, the problem can be divided into binary (when |L| = 2) or multi-class (when |L| > 2). On the other hand, multi-label classification problems allow the instances to be associated with more than one class [64].

It is important to notice that multi-label datasets are not equal even if they share the same

(28)

(a) (b)

(c)

Figure 2.1: Samples of databases with different experimental protocols: spontaneous (a), posed (b), and in-the-wild (c).

same labels. The following paragraphs describe four multi-label statistics that are later used to make an objective comparison of multi-label databases. After the analysis, it is easier to select the one that best fits the purpose of the present work.

Sorower [64] defines four multi-label dataset statistics (Equations 2.1−2.4) that can cause different algorithms to perform differently on distinct databases, based on the underlying assumptions of the algorithm. Let S be a multi-label dataset composed of n samples of instance-label pairs (xi, yi), 1 ≤ i ≤ n, xi ∈ X, yi ∈ Y , with a label set L, |L| = k. The following properties can be defined to compare different sets of data.

Distinct Label Set (DL): Number of distinct label combinations observed.

DL = |{y_i}| ∀y_i ∈ Y. (2.1)

Proportion of Distinct Label Set (PDL):

The DL normalized by total number of examples.

P DL = DL

|S| (2.2)

Label Cardinality (LC): The average number of labels per example.

LC = 1 n

n

X

i=1

|y_i| (2.3)

Label Density (LD): It is the label cardinality normalized by the the number of labels.

LD = LC

k (2.4)

Table 2.3 shows the number of labels instead of the number of samples because, for CK+ and FERA17, more than one instance can be related to the same label. Labels in CK+

are assigned to the last frame of the image sequence, allowing the user to define if the previous

(29)

2.3. DEEP LEARNING 13

Table 2.3: AU-labeled databases statistics (LC: label cardinality, LD: label density, DL: distinct label set, PDL: proportion of distinct label).

Database # Subjects # Labels # Classes LC LD DL PDL

CK+ 123 593 39 3.73356 0.09573 248 0.41821

FERA17 61 222,312 10 3.41998 0.31091 801 0.00360

DISFA 27 4,845 9 0.43199 0.04800 14 0.00289

BP4D-Spontaneous 41 146,848 28 4.35296 0.15546 4,068 0.02770

frames are used for other purposes. Labels in FERA17 are defined per-frame for the frontal- facing sequence of images and are extended to other eight head poses, totaling 2, 000, 808 samples. CK+ database has the highest number of classes and a lower number of labels, meaning that deep models would struggle on learning unless data augmentation techniques are applied. DISFA seems as the easier database to learn from the ones tabulated because it has a low number of classes, and the combination of low LC and a small number of distinct labels implies a simpler label space. Nonetheless, DISFA has a relatively small amount of instances which could make models struggle to generalize beyond images not seen in the training phase.

A big drawback that arises from the large number of classes in BP4D-Spontaneous is the huge number of distinct labels, 4068. Intuitively, a high DL makes the task harder, but a model that can successfully learn this setup will be more robust and could better capture co-occurrences. In addition, its high LC means that it has, on average, more labels per example; therefore, the complexity of learning BP4D-Spontaneous is even bigger. Compared to the latter, FERA17 has 51% more samples and 20 more subjects, which is desired for deep networks. It is also important to note that FERA17 has a higher label density of the four databases analyzed, meaning that the ground truth vectors are less sparse, partially explained by the small number of classes.

2.3 Deep Learning

The ability of an intelligent system to recognize patterns and learn from past experience is called machine learning. Machine learning enables humans to tackle tasks that are too difficult to solve with fixed programs written and designed by human beings [19]. Deep learning is a collection of techniques belonging to a family of machine learning methods that aim at learning data representations. LeCun et al. [38] define deep learning as

“... methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.”

Deep Learning is about accurately learning networks with a huge amount of stages [59].

These methods have produced promising results in image understanding and language mod-

(30)

Figure 2.2: Behind the scenes of a CNN. The outputs of each layer in a typical convolutional architecture corresponds to a feature map. Information flows bottom up and a score for each output is computed with an activation function, in this case a rectified linear unit (ReLU).

Figure obtained from [38].

Neural Networks (RNN). In their traditional form, CNNs capture and assemble local conjunctions of features, exploiting the compositional hierarchies of images [38]. In the other hand, RNNs process an input sequence and save information about the history of all past elements of the sequence. Because of training problems like gradient explosion and gradient vanishing, there are several variations of RNNs that augment the network’s long-term learning capabilities when their input data is a sequence of images.

2.3.1 Capturing Static Features

Neural network algorithms provide a robust approach to approximating real-valued, discrete- valued, and vector-valued target functions [48]. In its basic form, a neural network (NN) is a collection of nested processors called neurons, each producing a sequence of values. Input neurons are activated by the environment and the other ones get activated through weighed connections from previous neurons. Learning is about finding the value of the weights which make the NN behave as desired for a given input. Depending on the problem and the architecture of the network, the desired behavior may require long chains and multiple layers of neurons, where each stage transforms the aggregate activation of the networks.

Convolutional Neural Networks are a specialized kind of neural network for processing data that has a known grid-like topology [19]. As its name indicates, CNNs employ a math- ematical operation called convolution that is a special type of linear operation which learns small patches of features and convolves them through the entire image. Figure 2.2 shows what happens between layers in a CNN. This supervised learning algorithm is comprised of one or more convolutional layers, and several pooling and fully connected layers.

Convolutional Layer A convolutional layer transforms its input using a set of learnable filters (or kernels) that produce output features. This layer is defined by the number of kernels

(31)

K and its size k_w× k_h× k_qwhere k_w < w, k_h < h, and k_qis usually the same as the number of channels c of the input volume. The number of pixels that a given kernel slides per step in each dimension is called stride s = (sh, s_w). To avoid decreasing the size of the volume after each layer as a natural result of the convolution operation and maintain the spatial size of the input volume, the padding p hyperparameter defines the number of pixels to be added to the borders of a convolutional layer’s input. The output spatial dimensions of a convolutional layer are given by

w_o = w_i− k_w+ 2P sw

+ 1 (2.5)

ho = h_i− k_h+ 2P

s_h + 1 (2.6)

where w_i, h_i are the input spatial dimensions. The depth of the output volume d_o is equal to the number of kernels used on the convolution operation.

Fully-Connected Layer A fully connected (FC) layer takes an input volume and outputs an F −dimensional vector with F being the number of neurons. FC layers may be connected to a convolutional layer or to another of the same kind. FC layers learn two different parameters per neuron: a weight and a bias.

Pooling Layers A pooling layer down-samples the input volume. There are no parameters associated with this type of layer because it directly reduces an input volume. There are several types of pooling layers depending on the way they collapse their input: max, average and global. The first two are defined by a kernel with at most 3 dimensions, that replaces the value of a specific space in the input volume with the maximum or average activation values found there, respectively. When pooling in a global manner, a maximum or average function is used to reduce an input volume into a 1-D feature vector on its spatial or depth dimensions.

Modern image recognition techniques using deep learning saw a boost after AlexNet [36] won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [57] with a 10% advantage against the second best participant. In just seven years, the winning accuracy in classifying objects in the mentioned database increased from 71.8% to 97.3%, outperform- ing human abilities and proving that more data leads to better models. Winner models of the ILSVRC since 2012 have been implementations of convolutional neural networks (CNN), namely AlexNet (2012), ZF Net [75] (2013), GoogLeNet [65] (2014), ResNet [24] (2015), and SENets [27] (2017).

When the term deep is used to describe a certain network, it refers to the number of layers on it. In modern architectures, the best performances are obtained by deep neural networks comprised of hundreds of layers [1]. While CNNs have recently moved from the 10-layer to 100-layer regime, perhaps new possibilities will emerge with robust training of 1000+ layer networks [80]. Moreover, it is common to use a pre-trained network on a large dataset (e.g. ImageNet) as a fixed feature extractor for a specific task [2, 52, 56]. This process is known as transfer learning and it drastically reduces training times. Using a pre-trained

(32)

their designers, and further improve in a specific domain by fine-tuning the weights of some of the higher-level portion of the network. The use of pre-trained models may decrease the number of parameters to train and the number of epochs required for convergence.

In their traditional form, CNNs capture and assemble local conjunctions of features [38], assuming that there is no relationship between the input data. Figure 2.3 shows the feature maps generated by a conventional convolutional layer using standard kernels. It can be shown that the input is “squashed” into a feature map that ignores the possible relations between instances or channels. Thus, they can not completely capture the complex dynamic context involved in AU occurrence.

2.3.2 Learning Context using Deep Learning

Traditional approaches for temporal understanding of a sequence of images include the use of HOG3D [34], Local Ternary Patters [39] and SURF features generalized for videos [73].

However, they fail to identify motion features because they only “represent snapshots of videos in a small period of time and only capture static local patterns” [16]. Also, their hand-crafted nature makes them less flexible to generalize across different video settings and, consequently, other domains. Deep learning has become an attractive alternative in similar domains that require to capture context from a video rather than just from each frame, e.g.

abnormality detection from 3D medical images and human action recognition.

Deep learning procedures with the ability of temporal understanding can be grouped into three categories determined by the way they interpret time, 1) methods that extend image- based deep learning architectures and capture time by using aggregation methods, or 2) recurrent units, and 3) methods that are able to process spatiotemporal features natively.

2.3.3 Aggregation Methods

Aggregation methods combine predictions from standard video frames and other motion representations. Typically, this type of neural networks may use as input the standard [56] or warped optical flow [70], the difference between RGB frames [70], or a combination of several of these representations [61]. Aggregation methods encode the apparent motion between frames. Other approaches encode long and short-term dependencies using temporal pooling features [71, 3]. In general, these techniques can successfully perceive the evolution of a given action by characterizing it in terms of a static set of appearance features repeated over time.

Figure 2.4 shows the optical flow representation of a woman in a tennis court. However, they still operate independently in each input, hence cannot capture temporal smoothness between instances.

2.3.4 Recurrent Units

Recurrent units are connectionist models that have the ability to pass information across sequence steps while processing sequential data one element at a time [42]. While feed-forward neural networks can effectively capture important features of a given input, it is impossible for them to reason about previous events in order to make a context-based prediction of a single input. Recurrent neural networks (RNN) address this issue because they do not make

(33)

=

Output

3 × 3 × 2

.72 .25 .34 .46 .94 .38 .39 .58 .2 .70 .35 .71 .43 .28 .89 .24 .68 .85 .95 .23 .30 .62 .98 .62 .84

5 0 5 5 0 5 5 0 5

0 1 2 1 0 1 2 1 0

∗

3 × 3 × 1 Kernel # 1

5 × 5 × 1

Kernel # 2 Input Image

3 × 3 × 1

(a)

.72 .25 .34 .46 .94 .38 .39 .58 .2 .70 .35 .71 .43 .28 .89 .24 .68 .85 .95 .23 .30 .62 .98 .62 .84

=

Output

3 × 3 × 2

5 0 5 5 0 5 5 0 5

0 1 2 1 0 1 2 1 0

3 × 3 × 3 Kernel # 1

5 × 5 × 3

Kernel # 2 Input Image

3 × 3 × 3 .72 .25 .34 .46 .94

.38 .39 .58 .2 .70 .35 .71 .43 .28 .89 .24 .68 .85 .95 .23 .30 .62 .98 .62 .84

3 × 3 × 1

3 × 3 × 1 .72 .25 .34 .46 .94

.38 .39 .58 .2 .70 .35 .71 .43 .28 .89 .24 .68 .85 .95 .23 .30 .62 .98 .62 .84

∗

5 0 5 5 0 5 5 0 5 5 0 5 5 0 5 5 0 5

0 1 2 1 0 1 2 1 0 0 1 2 1 0 1 2 1 0

(b)

Figure 2.3: Regardless of the type of image that is processed by a CNN, e.g. grayscale (a) or RGB (b), kernels learn to produce features from single instances at each time step. There is no connection between input image i and i + 1.

Figure 2.4: Dense optical flow representation of two sequenced frames of a person in a tennis

(34)

the assumption that all inputs are independent of each other. RNNs output a probability distribution over the next element of a sequence given its current state, that takes into account previous states. Using memory, RNNs can hold the essence of what has been seen so far in order to make better predictions. All RNNs have the form of a chain of repeating modules (also called recurrent units) with hidden states whose activation at each step is dependent on that of the previous step. Given a sequence x = (x1, x₂, ..., x_T), a conventional RNN updates its recurrent hidden state h_tby

h_t=

(0 t = 0

φ(h_t−1, x_t) otherwise (2.7)

where φ is generally the composition of simple nonlinear functions such as a logistic sigmoid or hyperbolic tangent function applied element-wise.

However, as stated by Chung et al. [7], the conventional activation function has been shown to make gradient-based optimization methods struggle in capturing long-term dependencies because the back-propagated gradients tend to vanish or explode. To alleviate these phenomena, Hochreiter and Schmidhuber [26] proposed a different recurrent unit called long short-term memory (LSTM). Each LSTM unit maintains information at each step in the memory cell c. A forget gate f modulates the amount of memory that should be forgotten. When a new input comes, the input gate i defines the degree to which new memory content is added to the memory cell. The output gate defines whether the latest cell output is propagated to the final state h. An advantage of using the memory cell and gates to control the flow of information is that the gradient gets trapped in the cell, preventing from vanishing too quickly [60]. Unlike the vanilla RNN model, LSTMs are able to decide whether to keep the existing memory if it detects an important feature from an input sequence at early stages. In this way, LSTMs can potentially carry information about interesting features over a long period of time.

Following the implementation presented by Graves [21], the output h_t of each LSTM unit at time t is defined by the following composite function:

i_t= σ(W_xi∗ x_t+ W_hi∗ h_t−1+ b_i) (2.8) f_t= σ(W_xf ∗ x_t+ W_hf ∗ h_t−1+ b_f) (2.9) c_t = f_t c_t−1+ i_t tanh (W_xc∗ x_t+ W_hc∗ h_t−1+ b_c) (2.10) ot = σ(Wxo∗ xt+ Who∗ ht−1+ bo) (2.11)

h_t= o_t tanh c_t (2.12)

iteratively from t = 1 to t = T , where σ denotes the logistic sigmoid function, i_t; f_t; o_t; c_tand h_tare vectors that represent the values of the input gate, forget gate, output gate, cell output, and hidden output at time t, respectively. W are the weight matrices between different gates, and b are the corresponding bias vectors. The symbol denotes the element-wise product of vectors and the symbol ∗ represents a matrix multiplication. Figure 2.6 shows a single LSTM recurrent unit and Figure 2.5 illustrates how can LSTMs be integrated into a video classification pipeline. Note that each LSTM block represents a single cell whose output can be used to classify.

(35)

Figure 2.5: Common classification pipeline that integrates a CNN feature extraction phase and a sequence learning phase using LSTM cells.

(36)

(a) (b)

(c)

Figure 2.7: Output shape of a vanilla convolution operation over a 1-D image (a), a multi- channel image (where channels could be other frames) (b), and a 3D convolution over multiple frames.

2.3.5 Spatio-temporal Architectures

In conventional CNNs, convolutions are applied on two-dimensional feature maps: the spatial dimensions only. In those cases, all kernels have the same depth as the feature maps. Figures 2.7a and 2.7b show the output of a convolution operation done over a 1-channel image, and over a multi-channel image, respectively. In 2013, Shuiwang Ji et al. [29] first proposed to use three-dimensional convolutions in the convolutional layers of a CNN so that features are learned along both the spatial and temporal dimensions of input. The 3D convolution is achieved by convolving a kernel of size k × k × d over a cube formed by a stacked sequence of images of depth L, where d is the temporal dimension and d < l (see Figure 2.7c). The output of performing a 3D convolution of a single kernel over a sequence of images is a feature volume. Weights of these kernels are replicated across the entire feature map.

3DCNNs seem like a natural approach to video-based tasks because they naturally process spatiotemporal features through the use of 3D convolutions. There is an absence of proposals in the literature that tackles the AU occurrence detection task with 3DCNNs; these algorithms are commonly used for video HAR. The latter involves the identification of different actions from video clips using temporal information. Popular databases in this domain include videos annotated with sport labels [31], human-human interactions such as shaking hands, or human-object interactions like playing instruments [32], and body motions like a baby crawling [63]. HAR has been tackled with a plethora of traditional and deep learning approaches. The current state-of-the-art in the HMDB and UCF101 human action databases are deep neural networks [4, 79].

One issue with these models is that they have many more parameters than regular CNNs because of that additional kernel dimension. Consequently, previous proposals defined relatively shallow architectures of 1 [67], 3 [29], 5 [69] or 8 [68] convolutional layers with 3D kernels, for instance. The databases in which they are trained have been relatively small when compared to image-based databases like ImageNet. Therefore, pre-trained versions of these networks are almost always needed because training 3DCNNs from scratch for a similar domain is nearly impossible because of the high amount of parameters. However, Carrerira

(37)

2.4. RELATED WORK 21

and Zusserman [4] achieved a breakthrough by successfully adding a third dimension to the kernels in a 2D CNN trained on ImageNet to be compatible for usage in 3D networks. After- ward, Hara et al. [22] studied various inflated networks on the action recognition domain with interesting results that show that deeper 3DCNNs are more effective.

Converting a 2D network entails inflating all the filters and pooling kernels, i.e. adding an additional temporal dimension. A filter of size k × k is usually transformed into a cube of shape k × k × k. Besides inflating the architecture, Carreira and Zusserman propose a way to bootstrap parameters from pre-trained networks to take advantage of the knowledge captured from the ImageNet database. To do this, they propose to repeat linearly the weights of the 2D kernels along the temporal dimension, then rescaling them by dividing by the size of that temporal dimension. This ensures that if the network receives an input comprised of a stack of the same image, the output of the network is equivalent to its 2D counterpart. This method of inflating and transferring weights is followed for the experiments that involve spatiotemporal learning.

2.4 Related Work

As more data is available, deep learning has flourished as a successful approach to tackle the facial expression recognition problem. The recent success of CNNs in image-related tasks, such as object classification and image recognition, extends to the problem of facial expression recognition. The state-of-the-art in BP4D and FERA17 databases are based on CNNs. Table 2.4 shows the performance metrics of the best submissions for the FERA 2017 Challenge and its base classifier. From the pool of proposals that out-performed the paper baseline, only 2 utilized some type of context learning. Below is an overview of some of the best models proposed in the aforementioned database.

Table 2.4: Best results submitted for the Facial Expression Recognition and Analysis Chal- lenge 2017 (FERA17) [47]. Top submissions are based on convolutional neural networks (CNN).

Authors F₁score Temporal paradigm Romero et al. [56] 0.577 Aggregation

Tang et al. [66] 0.574 -

He et al. [23] 0.507 Recurrent units Batista et al. [2] 0.506 -

Paper Baseline 0.452 -

In the AU occurrence detection domain, Romero et al. [56] proposed the context- understanding model that showed the highest performance on the FERA17 challenge, an average F₁ score of 63.0 over the 10 AUs. They tested different configurations to combine OF and color information: 1) add OF embeddings as three additional channels of an input RGB image, 2) concatenate both RGB and OF embedding, resulting in a wider image, and c) considering two different streams and merging them in higher layers. They also added a

(38)

Figure 2.8: Romero et al. [56] concatenate a dense optical flow representation alongside its RGB image to a VGG-16 backbone encoder for AU occurrence detection.

As per the original challenge specification, they detect the view orientation by fine-tuning the GoogLeNet [65] architecture and then feed the video into 10 AU detectors (one for each AU) for the selected view. To begin, they selected AlexNet [36], VGG-16 [62] and GoogLeNet and trained each one of them in the union of three datasets for emotion classification tasks: CK+

[43], RafD [37] and Bosphrous [58]. The three networks were retrained for 5 epochs. Results showed that using original CNN weights yielded better results than random initialization. The VGG-16 based emotion encoder obtained the best improvement, so they picked that architecture for their final model. In the end, the best configuration of OF and RGB information was the vertical concatenation method.

In the second place, Tang et al. [66] introduced a preprocessing step in which they cropped out facial images by using morphology operations, such as binary segmentation, connected components labeling and region boundaries extraction. After, they trained an expert network by fine-tuning the VGG-Face CNN to detect the 10 AUs as a single-label binary problem. Authors show that their system is robust to misalignment of the face in real-world environments.

In the third place of the competition is a successful application of recurrent units in the AU domain by Jun He et al. [23]. Figure 2.10 shows the network architecture. They trained a different model for each of the 10 AUs, for each of the 9 possible views; a total of 90 models.

Their classification pipeline consists of a CNN model that first identifies the facial view angle.

Next, a different CNN model is trained to create a 500-length feature vector that represents a particular RGB frame. They resized the database into frames of 48×48. These CNN features are then fed to an LSTM network that learns the sequence of 15 feature maps provided by the CNN model, and provides features to two more FC layers that finally classify an input as containing one particular AU or not.

In fourth place, Batista et al. [2] proposed method jointly learn AUs occurrence and

(39)

2.4. RELATED WORK 23

Figure 2.9: Tang et al. [66] propose an ensemble of 10 fine-tuned VGG-Face models.

Figure 2.10: He et al. [23] utilize an LSTM network to encode temporal relationships between

(40)

Figure 2.11: Batista et al. [2] feature a concatenation of region-based learning and a holistic approach.

Figure 2.12: Zhao et al. [78] propose region layers to capture more detail about the face.

intensity from face images. Figure 2.11 shows the complete architecture. The network learns local changes caused by AUs through the top branch by dividing the face into 16 regions.

To learn features caused by the co-occurrence of AUs, the bottom branch of the network processes the face in a holistic way. Results of these parallel processes are concatenated to be classified by fully-connected layers.

Zhao et al. [78] introduced a new region layer that serves as an alternative design between locally connected layers in a neural network. They offer an end-to-end trainable nonlinear solution, which proves good results under complex conditions. The region layer performs the following steps: 1) divide input into an 8x8 grid, 2) each mini-batch is normalized using Batch Normalization, 3) pass each mini-batch through a ReLU activation function, 4) apply a local convolution, and 5) incorporate an identity addition with a skip connection to avoid vanishing gradient. The overall result of adding a region layer is that the model will learn more specific and concentrated regions for the corresponding AUs.

(41)

Chapter 3 Solution Model

To evaluate whether the temporal features can increase learning capabilities of a neural network on AU-labeled datasets, the following procedure described in this chapter is proposed.

The methodology followed permits the relatively fair comparison between the models introduced on the most adequate AU-labeled database.

3.1 Database Selection

The properties of FERA17 makes it suitable to be utilized for the temporal analysis of AUs in deep learning architectures. The database contains more than 2 million images, but only the 222,565 frontal-facing ones are selected for learning because the objective is to assess the learning capabilities of temporal features and not the generalization capabilities between distinct head poses.

FERA17 is derived from the Binghamton-Pittsburgh 3D Dynamic Spontaneous Fa- cial Expression Database (BP4D) [76] and the Multimodal Spontaneous Emotion database (BP4D+) [77] by taking their 3D data and generating 9 different 2D views showing distinct head angles for each of the 8 emotion-eliciting tasks. Figure 3.1 shows an example of a single subject and her corresponding nine views. The data is in form of RGB videos without audio, with a computerized black background, and annotated frame-by-frame with 10 different AUs:

AU1, AU4, AU6, AU7, AU10, AU12, AU14, AU15, AU17, and AU23. While the BP4D database has annotations for more than 10 AUs, the ones in FERA17 database were selected based on their higher frequency of occurrence and a sufficiently high inter-rater reliability scores [47]. There is a total of 549 videos, considering the 9 views, in the train and validation sets; test partition data is not publicly available. Label annotation is given in the form of comma-separated values (CSV) files that indicate the presence or absence of each AU as a ’1’

or ’0’, respectively. The percentage of frames annotated with a ’0’ on every possible AU is 8.7% in the training set, and 20.5% on the validation set.

The train partition includes the video of 41 participants (56.1% female, 49.1% white, ages 18-29) [47] recorded in 328 sessions from which the authors extracted a range of emotions and facial expressions. In this set, action units were annotated when they reached the A-level of intensity, i.e. highest intensity in a 5 level scale, and offsets when they dropped below it. To illustrate the variability of the occurrences, Figure 3.2 shows the fraction of coded

(42)

26 CHAPTER 3. SOLUTION MODEL

Figure 3.1: Nine different views of a subject. The FERA17 database consists in 2,952, 1,431 and 1,080 videos in the training, validation and test set, respectively. Image retrieved from [66].

frames in which each AU occurred. Dashed lines in the boxes show the mean values, and solid lines show the median value of occurrence rate. Outliers are data values beyond the ends of the whiskers and are shown as black dots. This visualization method is selected because it does not make any assumptions of the underlying statistical distribution of the data, which is unknown. The fraction of coded frames in which an AU occurred averaged 35.4% and ranged from 17% to 59%. There seems to be plenty of outliers for the action units with lower occurrence rate, namely AU1, AU4 AU15, AU17, and AU23, than the average. These observations increase the distance between their mean and median of occurrence rate and, consequently, the general average across all action units. The AUs with higher occurrence rate per session, e.g. AU6, AU7, AU10, AU12, AU14, have larger interquartile ranges meaning that their rate changes drastically between videos, even though their mean is higher.

Validation set incorporates the digital videos of 20 participants in 159 sessions with similar demographics as the train set. Action units were annotated when they reached a B- level, i.e. the second highest intensity in a 5 level scale, and the offsets when they dropped below it. Table 3.1 shows the number of occurrences of each AU in the train and validation sets. The difference in the annotation threshold between the train and validation sets could be explained by the significantly lower appearance of AU1 on the latter. The ranking of all action units based on their occurrence, presented in Table 3.2, shows that AU1 drops four places between the training and the validation set. The same A-level of intensity threshold could have reduced even more its occurrence and increase class asymmetry. In this data set, AU base rates averaged 26.2%, ranging from 5% to 60%. Figure 3.2 shows a similar distribution between train and validation sets in regards to the higher occurring AUs.

A deeper analysis shows that the database contains 250 frames that are annotated with inconsistent labels. Instead of being an array of binary elements, these labels contain ’9’s in their definition, e.g. [1, 0, 9, 9, 9, 1, 9, 9, 9]. The effects that a number 9 has over the learning procedure of a deep learning model can make it diverge. Even though these cases are only present in the validation set and are not significant for previous analysis, every instance that uses an inconsistent label is eliminated from the validation set to avoid affecting any posterior database manipulations.

In multilabel classification, predictions for an instance is a set of labels, therefore it can be fully incorrect, partially incorrect or fully correct. To assess the performance, a common practice among researchers is to measure how far the learning predictions are from the actual class labels. To address the notion of partially correct prediction, one strategy is to average difference between the ground truth and the prediction and then average over all samples. This method is called example-based evaluation. Another strategy for evaluating data partitions is to define a label-based evaluation, in which each label is evaluated separately and then

(43)

3.1. DATABASE SELECTION 27

Figure 3.2: Notched box plot displaying the distribution of the fraction of coded frames in which an AU occurred, or occurrence rate, across all the videos in the train (a) and validation (b) sets. The dashed line in each box indicates the mean value of occurrence rate whereas the

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey