Prestación de servicios en las zonas rurales

3.1.4. Economía rural y calidad de vida en las zonas rurales

3.1.4.3. Prestación de servicios en las zonas rurales

In [124], the authors recognise human actions by using a collection of spatio- temporal events which are generated by image sequences and localised at points that are significant in space and time. Spatio-temporal salient points

are extracted by calculating the variance in the data of pixel neighbours in both space and time. A measure of the distance between both sets of spatio-temporal salient points is calculated using a method which is based on Chamfer distance [24]. Liu et al. [105] propose to generate a semantic bag of video words using sample videos with Pointwise Mutual Information and diffusion maps. Spatio-temporal features are extracted from the actions and after feature quantization the actions are represented by a semantic bag of words. The training videos are converted to a bag of semantic words and a Support Vector Machine (SVM) is used to build a classification model of the training videos. An input action then under goes the same transformation process and the unseen video is converted to a histogram of semantic words. The classifier decides to which action the unseen video is most likely to belong.

Motion History Images, which were described in the previous chapter, have been commonly used to detect basic human actions in the past, however they struggle to accurately represent complex human motions. The authors in [21] use a method for representing motion in successively layered silhou- ettes that directly encode system time in what is called the timed Motion History Image (tMHI). This representation can be used to both determine the current pose of the object and to segment and measure the motions induced by the object in a video scene. These segmented regions are not “motion blobs”, but instead motion regions naturally connected to the moving parts of the object of interest. The method is used to recognise waving and overhead clapping motions to control a music synthesis program. [77] et al. propose a novel method which calculates the histogram of oriented gradi- ent (HOG) of a motion history image (MHI). Their algorithm first generates a MHI with differential images, essentially the result of frame differencing over each image which captures the human action. The second step com-

putes the HOG of the MHI and then a SVM is used to train a classifier with the HOG features. This step does not require the human to be extracted as a silhouette, which increases the overall performance.

Human actions can be represented as a series of postures over time in a 2D scene and a commonly used approach for representing posture is to use its boundary shape [73]. A comprehensive review of current approaches used to detect human actions using one or more cameras can be found in [73]. Since each border point in a digital image is similar to its neighbor point, it is inefficient to use the whole human contour to describe a human posture. There are, of course dimensionality reduction techniques such as Principal Component Analysis [151], which can reduce the redundancy, but these approaches are computationally expensive due to matrix operations. In contrast to high dimensionality, simple information like the X/Y variance of the human posture do not provide enough information to give enough information to recognise a large number of basic human actions. However Contour Features, overcome these issues and have been shown to be accurate in detecting human actions.

Previous approaches which use contour features include Fujiyoshi et al. [56], who use a process for analysing the motion of a human target in a video stream. Moving targets are detected and their boundaries extracted by extracting the human as foreground, using foreground extraction. From the foreground images a skeleton of a human is formed as shown in Figure 3.4, taken from [56]. Two features of motion are identified from the skeleton, the posture and the repetitive movements of the skeleton. Both cues give clues to human actions such as walking or running. This method has proven useful and it is not necessary to build a priori human model when employing this method. In addition, the computational cost is low, and it is an appropriate solution for practical deployments. One issue with this approach is that the

human actions need to be relatively simplistic.

The authors in [27] also use image skeletonisation to recognise basic human actions from a near view video. This action recognition method ex- tracts features from the human motions using star skeleton for recognition and these features are then modeled using Hidden Markov Models. Each human action is represented by a sequence of temporal images, which are transformed to an image feature vector using star skeletons from each image. Each feature vector of the sequence is allocated a symbol which matches a codeword in the code book using Vector Quantization [86]. Then the time- sequential images are converted to a symbol sequence. To train the system, the model parameters of the HMM of each category are optimised to give the best representation of the training symbol sequences for all categories of the human actions to be recognised. For human action recognition, the model which best matches the observed symbol sequence is selected as the recognised category.

3.3.1 Discussion

Contour features have been widely used to recognise human actions as the previous section illustrates. The advantages of using this approach is that it is computationally inexpensive, it gives a useful representation of the human silhouette and different vector ranges can be easily implemented to reduce or expand feature dimensionality size. In this thesis, we evaluate both MHI- HOGs and Contour Features. MHIHOGs use Motion History Images which utilises motion shape information of a video to recognise actions. The ad- vantage of MHI is in its simplicity and low computational cost compared to the optical flow method, for example. Moreover HOG are known to be a very accurate technique for representing movement in video [44] [22] [77].

(a) The outermost boundary pixels are identified by calculating the distance from the center of the object to the edge.

(b) In a preprocessing step, morphological erosion and dilation is applied and then the border is extracted

In document VERSIÓN 4 2 DE DICIEMBRE DE 2010 (página 130-134)