Anexo B: Información detallada sobre las sedes

Feature extraction techniques applied to visual speech recognition systems can be classified into the following three categories [14]

i) Shape based or Contour based.

ii) Appearance based or Intensity based, and

iii) A Combination of appearance and shape based features

The following section describes these techniques, their advantages and disadvantages.

6.1.1 Shape Based Feature Extraction

The shape based feature extraction techniques are concerned with the representation of the mouth, in terms of geometrical shape and dimensions of the lips such as height and width, as seen in Figure 6.1(a). This information is encoded with respect to the standard set of mouth shapes already in the system. In 1984, Petajan [41] proposed the first visual lip-reading system based on shape features. In this system Petajan extracted the speaker‟s mouth height, width, perimeter and area from the binary images as the features for the classifier. In shape based feature extraction, it is desirable to locate the exact location and position of visual speech articulators. This approach has the advantage of low dimensionality of features, however, the requirement of further localization and tracking may have an undesirable effect on the visual speech recognition system [61].

(a) (b)

Figure 6.1: Shape based features represent the physical shape of the mouth such as height and width, given in (a). Appearance based features consider the complete ROI, shown in

(b).

Generally, researchers [36, 64, 127, 171, 172] have considered the physical measurements such as mouth width, height and area of visual articulators to represent the speaker‟s mouth. Kaynak et al. [173] provided a detailed description and a comparative analysis of lip geometric features. To extract the geometric features such as height, width and contour of the lips, artificial markers have been applied on the speaker‟s lips [64, 173]. In today‟s systems, these artificial markers are not suitable in most of the scenarios.

In recent studies model based techniques have been considered for AVSR. In these approaches, a geometric model of the lip contour is used. Typical examples of this technique are active shape models [174], in which the inner and outer contour of the lips are extracted by a labelled set of points on the mouth. Features extracted from model based approaches can be divided into two categories. In the first category, parametric values used to define the lip contours are directly used as a feature vector. In the second category, parameters such as the values of mouth height, width, perimeter, area and ratios between the width and height are considered as a feature vector [24, 175, 176]. Other examples of model based techniques are deformable templates [177], active appearance model (AAM) [17], multi-scale spatial analysis (MSA) [24] and smart snakes [178, 179]. The AAM model is an extension of ASM, which combines a shape model with a statistical model of grey level surface of the ROI. The AAM has been demonstrated to outperform ASM [24]. Wojdel and Rothkrantz [180] introduced a new feature extraction

81 method that is a model free approach used to describe the shape of the lips. It was based on an image segmentation technique used to detect the pixels belonging to the lips and known as lip geometry estimation (LGE).

The major drawback of AAM and other model based approaches is that it requires manual annotations on the training samples. The performance of AAM decreases if the speaker‟s sample is not included in the training sample. Model based techniques are also sensitive to the facial skin colour and hairs on the face but are insensitive to illumination variations and image noise [75].

6.1.2 Appearance Based Feature Extraction

The major shortcoming of the shape based feature extraction techniques [17, 174] is that they only consider the geometrical information of the lips to represent mouth shapes. These techniques only analyse the lip contour information and do not consider the other speech articulators, whereas the appearance based representations are concerned with the low level features. Appearance based techniques use the information from the complete pixel values available in consecutive images of the ROI as shown in Figure 6.1(b). In addition to the lips, other speech articulators such as teeth, tongue, jaw and some surrounding muscles of the mouth which are informative about the visual speech [60] are implicitly included in this technique. The raw pixel values in the ROI contain the salient information of speech and can be directly used as a feature vector [181]. However, a high dimensional ROI contains a large number of pixels and can be an overburden to the statistical classifier and would be problematic. In this regard, some image processing/transforming techniques are applied on the ROI image to extract a compact and meaningful feature vector.

Principal component analysis (PCA) is a well-known data dimensionality reduction technique, used for lip-reading by various researchers [61, 63, 181-186] and achieves good results. Another technique that is superficially related to PCA is independent component analysis (ICA) which is also used for lip-reading [162]. However, results

82 achieved by traditional PCA have outperformed those of the ICA representation for lip- reading [162].

In other approaches, researchers have used linear image transforms such as discrete wavelet transform (DWT) [185], vector quantization [187] and discrete cosine transform (DCT) [33, 63, 184, 188, 189] to compute the features from the ROI. In these data compression techniques, statistical redundancies in the image are removed with respect to transformed coefficients. Computation of these transforms such as Fast Fourier Transform (FFT) become faster when the image size is of a power of 2 or a square image, so can be used in real time implementation of a VSR system [190]. Potamianos et al. [35] applied the first and second order derivatives of a DCT on a ROI to capture the visual speech features. The advantage of intensity based feature extraction approaches over the shape based approaches is that they do not need a priori statistical lips model. This advantage leads to the development of a computationally efficient VSR system [169].

6.1.3 Hybrid Features

Hybrid feature extraction combines both appearance and shape based representations. The structure of this approach is based on the hypothesis that the high level features that describe the geometric shapes of a mouth and the low level features that assume the pixel level features are complementary to each other. By combining them, performance can be augmented. Luttin et al. [191] presented a speech reading system in which they combined the shape based ASM features and intensity based PCA features. A similar combined feature extraction method was used by Chiou and Hwang [182], who used the snake contour model with PCA. Chan [192] described his feature vector using geometric features with PCA features. These approaches concatenate both sets of features into a single feature vector. In addition to the above approaches, the active appearance model (AAM) implicitly extracts both the appearance and shape features [17].

6.1.4 Motion Based Feature Representation

In 1994, Goldschen et al. [127] introduced dynamic features to represent the actions of the oral cavity during speech. They demonstrated that motion features are more

83 discriminative as compared to static features. Besides the seven static features from the oral cavity, such as width, height, area and perimeter, they also extracted the dynamics of each feature by computing the change of each feature between consecutive frames and then calculated the second derivative. Their results indicated that the dynamic features are more discriminative as compared to static features.

Rosenblum and Saldaa (1998) [193] described the static and dynamic visual speech information, where features extracted from static mouth images are static (pictorial) features. The appearance based and shaped based features reside in this category. The dynamic features directly represent the motion or dynamics of the mouth movement with respect to time varying visual information. Recently, Yau et al. [26] computed the dynamic features to represent visual speech. In their approach, mouth motion is represented by a motion history image (MHI), computed by image subtraction of consecutive images.

As described in Chapter 4, the motion of the object in an image sequence provides sufficient information. An optical flow motion tracking approach is adopted in this research. It is an alternative method to the static analysis of the input video of a ROI. The motivation for using motion features is that they provide the temporal characteristics of visual speech information or lip positions [194] and describe the actual movement of the mouth. Another reason in favour of using temporal image data is the validation by Rosenblum and Saldaa [193] that time-varying information is important for visual speech perception as compared to time independent information. The optical flow has been used for lip reading since the beginning of this research domain. However, computational complexity and low performance optical flow algorithms have restricted its use. The availability of powerful CPUs/GPUs and very sensitive algorithms have allowed the use of optical flow for real time applications.

In the following section, feature extraction and classification techniques applied on the concise images computed in Chapters 4 and 5 are described. The concise images were computed by means of DMHIs and non-overlapping block based approaches.

In document Informe de actualización sobre la colaboración entre los organismos con sede en Roma (página 25-35)