3. Algoritmos de planificaci´ on
3.4. Inclusi´ on de t´ ecnicas CSP en las planificaciones
Table 6.1 Non-verbal behaviour channels
Channel References
Head movement Buckingham et al. (2014); Rothwell et al. (2006); van Amelsvoort et al. (2013); Won et al. (2014)
Posture / Body Patterson et al. (1980); Won et al. (2014)
Eye gaze Buckingham et al. (2012, 2014); Doherty-Sneddon and Phelps (2007); Emmorey et al. (2008); Ishii et al. (2013); Khandait et al. (2011); Rothwell et al. (2006) Eye contact Patterson et al. (1980); Vrij et al. (2000)
Eye brow position Khandait et al. (2011)
Eye openness Buckingham et al. (2012, 2014); Khandait et al. (2011); Rothwell et al. (2006)
Physiological Buckingham et al. (2012, 2014); Haapalainen et al. (2010); Rothwell et al. (2006, 2007)
EEG, ECG and GSR channels are not used in this research, as they require body-attached sensors. Equally, high-speed or infra-red cameras will not be used. This choice has been made so as to ensure that the technology developed and evaluated could be deployed cheaply and conviently in a real-world classroom environment.
6.3
Face and facial-feature detection
Face and facial feature detection is an important component of any compu- tational method for the analysis of NVB, as the majority of NVB channels highlighted in Table 6.1 are expressed by the face or head. Surveys of methods for detecting faces in images (Bakshi and Singhal, 2014; Gupt and Sharma, 2014; Hatem et al., 2015; Lu et al., 2012; Yang et al., 2002) show the diverse solutions available, each with benefits and limitations. Two approaches which regularly feature in literature are artificial neural networks (Haykin, 1994), as used in Buckingham et al. (2012, 2014); Gupt and Sharma (2014); Rothwell et al. (2006, 2007), and Haar cascades (Viola and Jones, 2004), as used in Castrillón et al. (2010); Castrillón-Santana et al. (2008).
Both artificial neural networks and Haar cascades have a role to play in this research. Due to the specific context in which image data is recorded, literature
110 Modelling and classifying patterns of non-verbal behaviour
suggests that a two-step combinatorial process will play to the strengths of each approach while mitigating the weaknesses. This section presents a description and evaluation of both approaches.
6.3.1
Artificial neural networks
Literature (Buckingham et al., 2012, 2014; Gupt and Sharma, 2014; Lu et al., 2012; Rothwell et al., 2006, 2007; Yang et al., 2002) shows that Artificial Neural Networks (ANN) (Chapter 5), specifically Multi-layer Perceptron Networks (MLP), have commonly been used to classify whether regions of an image contain a face. An advantage of ANN for face detection is the ability to train the classifier to recognise faces in almost any position (Yang et al., 2002). Unlike other methods, such as Haar cascades (Viola and Jones, 2004), where facial landmarks are important, an ANN can learn arbitrary discriminant patterns to identify a face. However, a neural network has a fixed input length.
A limitation of MLP is that they can only accept an input of a pre-defined length. When handling image regions, this input is a vector of pixel values. The fixed input length for an ANN meaning the region for classification is expected to be of a fixed height and width (Gupt and Sharma, 2014; Lu et al., 2012; Yang et al., 2002). Whether this aspect of the MLP is problematic depends on the specific application, context of use and conditions under which image data was recorded.
For facial detection, the fixed input of an MLP becomes more problematic due to scale variance of facial features. In situations akin to CCTV monitoring, where there is distance between camera and subject, the a region can be estimated inside which all faces will fit. In this scenario the height and width of the region of interest (ROI) can be well defined.
However, as the face moves closer to the camera the scale variance becomes more extreme. In the context of a web camera several inches away from the subject, an action such as leaning backwards can cause the face to halve in size.
6.3 Face and facial-feature detection 111
In literature (Buckingham et al., 2012, 2014; Rothwell et al., 2006, 2007), practical applications have assumed a fixed size of face by controlling the distance and relative positions of camera and subject. Doing so ensures the face fits within known height and width boundaries, meaning the image can be searched efficiently for regions containing the target feature. However, if the size of the feature were not known the same approach would be inefficient. The image would need to be searched many times, each time with height and width for the region set to a different size.
One solution to achieve scale invariance is to use principal component analysis (PCA) (Pearson, 1901). PCA is a form of factor analysis which reduces the total variance of the data by removing highly variant weak interactions in order to produce a smaller set of linear factors. The effect of PCA is to reduce the overall dimensionality of the input data while maintaining the important factors, the principal components. PCA is commonly used (Bajwa et al., 2009; Cooray and O’Connor, 2004; Kamencay et al., 2013; Xiao, 2010) in image classification problems where the raw pixel data provides a large input vector with many redundant features and the classifier needs to learn meaningful pixel combinations.
While PCA provides a solution to the fixed input width problem with MLP, such extreme scale variation encountered when camera and subject are only inches apart still causes a high degree of error in recognising face patterns.
For this reason it is necessary to split the process of feature detection and feature-behaviour classification between two different technologies: a highly scale tolerant algorithm to locate the face, eyes, nose and mouth within the image and a pattern classifier to learn comprehension indicative patterns of NVB.
112 Modelling and classifying patterns of non-verbal behaviour
6.3.2
Haar cascades
Haar cascades are collections of weak classifiers designed to recognise haar- like features, wavelets, within images (Viola and Jones, 2004). Collectively, the ensemble of classifiers form a meta-algorithm capable of highly robust classification (Hatem et al., 2015). Haar cascades have gained popularity over recent years as they perform exceptionally well at detecting known-angle faces with variant scale.
First described by Viola and Jones (Viola and Jones, 2004), these simple but effective object recognition classifiers have become ubiquitous in modern technology. Haar cascades search images for simple patterns (figure 6.1) of intensity which are learned from training over large sets of specifically designed training images. To train a face detection Haar classifier a large set of images containing faces on random backgrounds, often with small random distortions applied to the face, is created.
Figure 6.1 Haar-like features (Hatem et al., 2015)
A major advantage to Haar cascades over neural networks, when used to locate a face, is speed of detection for variant scale faces. The ensemble of weak classifiers in the cascade is able to disregard the background of the image quickly, to focus search time effectively on the most face-like regions of the image (Hatem et al., 2015). The patterns of intensity can be applied at any scale in the image, meaning that no strict control of relative positions for camera and subject must be enforced. However, as observed in Gupt and Sharma (2014), Haar cascades perform best on forward facing faces and, unlike neural networks, they do not perform well on faces viewed at different angles.