Agenda Digital para Europa - ADMINISTRACIONES PúBLICAS

ADMINISTRACIONES PúBLICAS

3. Agenda Digital para Europa

A summary of the hand detection methods reviewed in this section is listed in Table 2.2; methods that additionally track specific hands over time are listed in Table 2.3. It can be noted that raw hand detection methods almost exclusively use depth cameras. Many methods employing colour cameras make assumptions of hand motion to simplify the problem. These algorithm’s therefore explicitly track hands in consecutive frames.

Hand Tracking in Colour Images

As an exception to this statement, Dailey and Bo Bo [47] described a hand detection method using a colour camera. The authors applied Haar-like features to detect hands in crowded scenes, although the results had limited success. In contrast, Valibeik and Yang [56] detected hands with skin colour detection and a motion mask, applied to skin coloured areas. When the

Table 2.2: Select summary of previous hand detection methods.

Method Camera Feature Description Classifier

Dailey and Bo Bo [47] Intensity Temporal Haar-like AdaBoost

Plagemann et al. [1] Depth Image patch around keypoint at

mesh extremities

“Joint Boost”

Shotton and Sharp [2] Depth Per-pixel depth differences Random Forest

Ikemura and Fujiyoshi [48] Depth Depth histogram difference be-

tween two regions

AdaBoost

Li and Kulic [49] Depth Distance from mesh extremities to

nearest edge in radial directions

Nearest Neighbour

Table 2.3: Select summary of previous hand tracking methods.

Method Camera Summary

Yang et al. [50] Colour Detect hands by segmenting moving areas and ap-

plying a skin colour mask. Track using similarity function

Bretzner et al. [51] Colour Detect hands with skin colour. Track with particle

filter

Shan et al. [52] Colour Detect hands by segmenting moving areas and ap-

plying a skin colour mask. Track using particle filter and mean shift

Carbini et al. [53] Depth

& Colour

Detect hands as skin coloured regions within a set distance to a detected face. Track using colour and position probabilities.

Ghobadi et al. [54] Depth Segment hands with per-frame clustering using K-

Means and EM

Nickel and Stiefelhagen [55] Depth

& Colour

Detect hands as connected skin colour regions. Track the highest probability candidate based on position and colour

Valibeik and Yang [56] Colour Detect hands by segmenting moving areas and ap-

plying a skin colour mask. Track with a Kalman filter

Chen et al. [57] Colour Single user’s hands manually initialised. Tracked

2.3. Localising Body Parts 55

resulting hands were tracked with a Kalman filter, the results were more accurate.

A similar approach was adopted by Yang et al. [50] for extracting a single user’s hand motion. Multiscale image segmentation was performed every frame, to isolate contiguous, homogeneous regions. Region matching between consecutive frames was achieved by minimising an error function that considers similarity between expected positions, area and average intensity. Mo- tion between frames could then be estimated by calculating the affine transformation between matched regions. Neighbouring regions with similar motion were then merged, and a skin colour mask was applied. The largest elliptical region was defined as the head, with the proceeding two being classed as hands.

Particle filtering has been used by many methods to track hands in colour images. Chen et al. [57] and Bretzner et al. [51] described methods to track a user’s hand with a single particle filter. Each method employs a different likelihood function: Chen et al. used multiple visual cues, including shape information, motion continuity and colour; Bretzner et al. used skin colour detection and scale-space extrema of blob and ridge features. Chen et al. required a set prior pose to initialise hand tracking. However, Bretzner et al. automatically tracked skin coloured regions of interest.

When tracking hands using particle filters, Shan et al. [52] proposed a solution to tackle the degeneracy problem, where only one sample has a non-negligible weight. Similar to previous methods, a colour probability image was formed by detecting skin colour regions in the image. A difference image of the current and previous frame was also calculated, and thresholded to include only moving regions. Sample weights in the particle filter were based on these two statistics. After samples were propagated with a constant motion model and reweighted, they were shifted to areas of high likelihood in the observation function with the mean shift algorithm [58].

Many of the techniques used by the previous colour image methods, such as continuous motion assumption, are not necessary in depth based methods. For example, Ghobadi et al. [54] proposed an early hand segmentation method using time-of-flight cameras. A per-frame clustering solution was employed, where pixels were initially segmented using K-Means. Using the

generated cluster means, Expectation Maximisation was used to refine the resulting Gaussian mixture model. However, several authors proposed sensor fusion methods that combine colour techniques with depth information.

Carbini et al. [53] focused on combining colour techniques and depth information from computation stereo, to track the hands and heads of two people. Hand detection was performed by identifying skin coloured blobs, which were localised using a depth image. Anatomical con- straints were used to filter unnatural hand-head combinations. Tracking results were then fused with speech to facilitate human-computer interaction with a large visual display.

Using the same sensors, Nickel and Stiefelhagen [55] proposed a method to recognise pointing gestures during HRI. Similar to Carbini et al., skin colour analysis was used to detect heads and hands, with their 3D positions estimated using computational stereo. A user’s head and hand locations were tracked in time by evaluating their movement since the previous frame, anatomic likelihood, and skin colour probability.

Hand Detection in Depth Images

Most depth-based hand detection methods detect a number of features in an image, describe the features using a vector of information called a descriptor, and classify this descriptor using a machine learning classifier. Many feature descriptors have attempt to characterise objects by their shape [59, 60]. One of the most successful is the “shape context”, by Belongie et al. [61], which the authors used to measure the similarity of two shapes.

In the method, keypoints are generated as edges from the Canny edge detector [62]. The shape context descriptor is a coarse histogram of the relative coordinates of the edge locations in the shape. Histogram bins are of log-polar space, so that histogram entries closer the source edge are weighted more heavily. Point correspondences between two shapes can then be calculated using a cost matrix of the histogram differences between the two shapes’ keypoints. The transformation that best aligns the two shapes can then be estimated.

2.3. Localising Body Parts 57

(a) “Laboratory conditions” (b) “A significantly more complex scene”

Figure 2.7: Images from the datasets Plagemann et al. used to evaluate their body part detector [1]. (a) shows the method being evaluated under “laboratory conditions”, whilst (b) shows results from “a significantly more complex scene”. Images are sourced with permission from [1].

akis et al. [63] to segment and label 3D meshes. In total, eight different shape descriptors were used, in conjunction with a conditional random field (CRF) [64], to classify all faces on a particular mesh. Anguelov et al. [65] described a similar method where, instead of a CRF, a Markov Random Field was used to address the problem of segmenting 3D scan data into objects and object classes.

A modified shape context descriptor was developed for head, hand and foot identification, by Li and Kulic [49]. Keypoints were detected at body endpoints using a hierarchical algorithm. The distance to the nearest edge was then calculated, in radial directions around the keypoint. The furthest distance denotes the keypoint direction. Starting at this direction, the rotation independent “local shape context” descriptor is the concatenation of all radial distances. De- scriptors were classified as the nearest neighbour to ideal templates.

Plagemann et al. [1, 66] described a similar method for head, hand and foot detection, with keypoints as body end points. The centre of each mesh in the input point cloud was first found. The geodesic distance of each point in the mesh was then calculated. The “AGEX” (Accumulative Geodesic EXtrema) keypoints are the mesh points with the largest geodesic distances from the centroid, and all other AGEX points. A patch based descriptor was applied to each keypoint, to assign it to one of the three classes. Figure 2.7 shows images of the datasets Plagemann et al. used to evaluate their method.

θ

₁

θ

₂

Figure 2.8: Illustration of the depth image descriptor from Shotton et al. [2]. A yellow cross indicates the pixel, x, being classified. From Equation 2.2, the two offsets from each pixel, x, are denoted in the figure as: θ = (u, v). The offsets are illustrated with red circles.

details a thirty-one class body-part detector, using a per-pixel classification approach. A simple depth descriptor was used:

f (x, u, v) = dI x+ u dI(x) − dI x+ v dI(x) , (2.2)

where x is a pixel in the depth image dI, and u and v are pixel offsets. Many of these per-pixel

depth differences, illustrated in Figure 2.8, are evaluated, with different u and v offsets, using a random forest classifier [67]. This algorithm is very fast, but is unsuitable for human-robot interaction as background subtraction must be performed as a pre-processing step.

Many body part localisation methods have used background subtraction (BGS) to simplify the problem and decrease computation. This does, however, make them unsuitable for HRI on mobile robots. As detailed in a recent survey [68], per-pixel GMMs are a widely used method of colour image BGS [69–71]. St¨ormer et al. [72] and Langmann et al. [73] extended BGS methods to depth cameras by generating GMMs for both the infrared and depth images of a TOF camera.

A similar descriptor to Shotton et al. was proposed by Ikemura and Fujiyoshi [48]. Rather than calculate the depth differences of two pixels, the authors constructed a descriptor that expresses the depth differences over two regions. Normalised histograms of depth values are

2.3. Localising Body Parts 59

computed over squares of 8× 8 pixels. The descriptor is defined as the Bhattacharyya distance

between histograms of two regions, and is termed the “relational depth similarity feature”.

In document Derecho y tecnologías avanzadas (página 163-176)