• No se han encontrado resultados

Requisitos para la presentación de un proyecto

III GLOSARIO

The two fundamental principles in ensemble systems of diversity and vote-combination provide ensemble-based approaches with considerable flexibility to tackle the most chal- lenging of machine learning problems. Given that ensemble-based systems intersect with such a wide spectrum of disciplines from computer science to statistics and data mining, the benefits are that many varieties of strategies have become available to innovate with. Ensemble solutions based on CTF strategies continue to provide effective mechanisms for trading-off the computational loads of inducing decision boundaries on massive and complex data streams with those of accuracy. Informed bootstrapping methods used on cascades, enable large reductions of data that an inductive algorithm explicitly needs to learn on, while generating real-time capable classifiers. Meanwhile, learning from imbal- anced datasets has remained an ongoing issue for accurate classification which can be mitigated to a degree by cascaded classifiers.

Arguably, one of the most relevant and interesting challenges of late has been the need for classifier adaptability to changing environments. Much recent research has involved ensemble-based learning and its natural applicability to this as yet unresolved problem. The capacity of ensemble systems to be partitioned into clusters, whereby, each cluster becomes responsible for encoding specific concept drifts holds promise.

The succeeding chapters explore the various strategies developed in the course of this research that leverage coarse-to-fine learning principles in order to address some of the open questions and challenges outlined.

29

Chapter 3

Methodology

This chapter describes the general methods that were common to all experiments in this research. It covers the overview of the major datasets, feature types and their extraction methods as well as the types of weak learners employed. Training procedures and classifier evaluation methods are also discussed together with the description of the environment and the hardware used. Since the methodology here is not exhaustive, the additional specifics and details of each experiment are expounded on in the relevant chapters.

3.1

Face Detection Datasets and Feature Types

A total of 15000 facial images were collected for training face detection classifiers. The im- ages were gathered from various publicly available datasets. The sources were: FERET1,

Yale Face Database B [53] and the face database from the Vision Group of Essex Univer-

sity. Where it was required, the images were cropped in order to isolate only the facial features to the exclusion of other areas of the head as much as possible. All facial images were extracted and vectorized to a reduced size of 24×24 pixels. The 24×24 pixel size has been a standard dimension2 for training facial images.

The images were predominantly frontal without rotation and articulation; however, particularly on the Yale dataset, images with some articulation, rotation and strong illu- mination variations that resulted in feature occlusion were incorporated into the corpus. In addition, the images from the Essex University Vision Group contained facial gestures. The accuracy of a classifier is limited by the quality of the training sets used and it was understood from the outset that due to the large in-class variation, the resulting classifiers would not compare to the state-of-the-art face detectors; however, this was not the goal. The incorporation of such challenging images was seen as an interesting test scenario for the abilities of the algorithms to handle outliers. The images themselves were trained without any pre-processing steps to counter the effects of the differing lighting conditions, nor was any additional synthetic data generated for fine tuning classifiers for the test dataset.

1URL ”http://www.frvt.org/FERET/default.htm”

2This was the dimension used by Viola and Jones [173] and claimed to be the standard by Huang

et al. [66]. 24×24 is also the default setting for the OpenCV implementation for face detection training.

However, some notable face detectors have used different kernel sizes: Rowley [135] 20×20, Poggio and

The negative dataset consisted of a total of 2500 images. The dataset was a mixture of images downloaded from the internet, consisting of a variety of patterns, as well as a combination of high resolution photo images with different landscapes and backgrounds. Using the sliding window approach at different positions and scales on the original images, hundreds of millions of negative samples were generated from these 2500 images. The bootstrapping approach of Sung and Poggio [155] and discussed in Section 2.3.1 was used at each layer, whereby all correctly classified negative samples were removed from the current negative training set. They were then replaced by new negative samples which the current classifier classified as false positive. Viola and Jones [173] replace negatives with new subwindows from an image whose positions and scales are randomly selected. The experiments here did not randomly select parts of an image, but instead sequentially scanned the image using a subwindow. Each subwindow representing a negative sample scanned a different negative image for samples. This approach carried lower computational overheads and was also easier to repeat in order to validate the training code.

The test set for non-adaptive classifiers was the standard benchmark CMU MIT dataset [136]. This dataset consists of 130 gray-scale images which contains 507 faces. Using the raster scanning method discussed below, 72,654,174 negative samples can be generated from all the images. This dataset was specifically created for detecting frontal views of human faces and is also made up of low resolution and low quality images. Due to the presence of some cartoon-like faces of different quality, some publications have only used subsets of this dataset for testing. This research has considered the entire set.

This research considered adaptive face detection classifiers in changing environments which required an additional domain-specific dataset. Due to a lack of publicly available datasets for this problem, a custom dataset was created. The specifics of it are presented in Chapter 6.

Haar-like Features

In order to combine pattern recognition with face detection, a decision had to be made as to what feature types to use for extracting information from the images. Papageorgiou et al. [113] introduced Haar-like features to machine learning by modifying the original Haar wavelet decomposition to allow them to perform object recognition at a finer resolution. Since the successful application of Haar-like features to face detection by Viola and Jones [173], they have become common to vision detection systems where the target objects display consistent geometrical properties. For this reason, Haar features were applied in this research. The Haar features used in this research are shown in Figure 3.1.

Haar-like features are a mixture of square and rectangle filters, each divided into two main regions. The filter is applied to an image at different scales and the value for each filter is the difference of the two regions; where each region is the sum of all its pixel intensity values. The complete Haar-like feature set is applied to an entire image frame producing an over-complete feature space. The amount of information that accompanies this type of a feature set is much greater than the amount of actual raw image data.

3.1. Face Detection Datasets and Feature Types 31

Figure 3.1: Haar-like feature set.

Object Detection

When using feature-based approaches that are combined with Haar wavelets, it is neces- sary to perform an exhaustive search over the entire image. The object detection phase of the face detectors in this research consisted of several steps. The process involved is as follows:

1. the entire image is converted to gray-scale

2. an intermediate integral image is computed based on pixel intensities

3. a sliding window exhaustively scans the entire integral image at differing positions and scales

4. for each weak classifier, a single Haar feature is extracted from the integral image 5. a classification is made to accept the window as the target object or reject

Many state-of-the-art detectors perform pre-processing of an image before a classifier is applied. This is usually in the form of normalization in order to minimize the effects of large illumination variances. The experiments here however omitted this step and performed detection on the raw data.

During the testing of the classifiers on the CMU MIT dataset, each positive detection was compared to the ground truth file. The ground truth file provided four pixel positions that define a rectangle around a positive object. However, there is a large degree of subjectivity involved in determining how large or small the defined rectangle needs to

become before a subwindow that encloses a target object is no longer a face. As a result, a rule was formulated which confirmed a positive detection only if all four corner pixel positions were displaced from the ground truth positions up to the maximum of 25% of the width in the x-axis, as well as the height in the y-axis.

Since the training images possess a degree of translational variability, the face detector classifier is able to handle small shifts in an image. For this reason, the sliding window did not need to scan every pixel, but could be shifted a predefined ∆ pixels across and down the two axes following each classification. The choice of ∆ has a significant effect on the speed of the detector but can also reduce accuracy. In all the experiments ∆ was set to 2, while the scaling of the sliding window was set to 1.23. The details of the Haar features and their extraction are summarized in Table 3.1.

Fast feature calculation was performed through the intermediate representation of an image termed the integral image. Any point (x, y) on an integral image defined as

ii(x, y) = X

x′x,yy

i(x′, y′), (3.1) over the original image i(x, y) contains the sum of all pixel intensities to the left, and above of (x, y), as well as inclusive. The integral image can be computed rapidly with one pass. Since Haar wavelets consist of rectangles, all that is needed in order to calculate the sum of intensities within it is the four corner pixels. Beginning with the top left corner pixel, let the four points on an integral image be defined as:

ψ1=ii(x, y)

ψ2=ii(x+w, y)

ψ3=ii(x, y+l)

ψ4=ii(x+w, y+l)

wherew is the width and lis the length, then the sum of the intensities bounded by the four points can be calculated asψ4+ψ1−(ψ2+ψ3). This calculation requires only four

array accesses which enables extremely fast calculation of the Haar feature sets.

Table 3.1: Training dataset details together with the properties for constructing thestatic

classifier.

Property Attribute

Number of Haar-like feature types 8 Maximum Haar-like features per image sub-window 200,000 Minimum pixel area size per Haar-like feature 16 Sliding window scale factor 1.2 Sliding window shift step ∆ 2 Initial sliding window size 24×24

Documento similar