Local activity representation is the most widespread way to describe an activity today, as it leads to State-of-the-Art (SoA) accuracy rates. Local patches are usually sampled either densely or by using a spatio-temporal detector. Specialized descriptors are built around the sampled interest points and a Bag-of-Visual-Words framework is used in order to aggregate them into a fixed size feature. The spatio-temporal descriptors are usually extensions of image-based 2D appearance histograms. They describe the region around the interest points, expanded into the temporal dimension (i.e. 3D-cuboids) in order to describe activities. The main advantages of local representations are:
they are relatively independent to scale and shift invariant, they can deal with partial occlusions (i.e. human/object, object/object) and they do not need a preprocessing step (e.g. background subtraction, motion segmentation) to avoid possible failures.
However, they suffer from their orderless representation, as BoVW methods do not retain spatio-temporal correlations among the features.
Spatio-temporal interest point detectors:
Interest point sampling is the initial step that a local-based technique requires in order to describe an activity. Spatio-temporal detectors minimize specific saliency functions in order to detect interest points which are induced by sudden changes in appearance and/or motion.
One of the earliest spatio-temporal detectors was proposed in [9] and then in [56], where a Harris corner detector [57] is extended to temporal domain. Space-Time interest points
.5
Figure 2.5: Spatio-Temporal Interest Point (STIP) or Harris3D detector introduced on [9]
are chosen as the points whose local neighbour, which is automatically selected, has a significant variation in both the spatial and temporal domain. A characteristic example of how the descriptor is constructed is depicted in Fig. 2.5.
Spatio-temporal action cuboids [39] are another concept behind interest point detection that can be found in the literature. They detect local maxima from a combined detector framework that uses Gaussian operators and Gabor filters in order to increase the sparse number of features that [9] provides. An improved work of [39] is presented in [58] where Gabor filters are combined with a differencing mask and different temporal scales are taken into account in the feature selection in order to tackle [39] limitations.
Another work that detects spatio-temporal interest points in videos was proposed in [40].
The authors extended a salient region detector by applying an entropy metric within a cylindrical region around each candidate point. The salient points that are selected are thresholded points that maximise the energy locally.
The Hessian detector was also extended to the temporal domain in [59] for interest point detection. Integral video structure and the determinant of a 3D Hessian matrix are used in order to flag the salient feature locations.
Despite the significant research effort devoted to the development of an accurate spatio-temporal interest point detector, it has not resulted in significantly increased recognition accuracy rates. The main disadvantage of these techniques is that they are sparse and interest points extracted shown to be insufficient for describing actions discriminatively.
Thus, related work [60], inspired by the recent methods of image classification [41], [42]
has turned its attention to dense sampling approaches.
Spatio-temporal descriptors:
Spatio-temporal descriptors are computed around the resulting spatio-temporal interest points. Similarly to spatio-temporal interest point detectors, these structures extend image-based descriptors to the temporal domain in order to represent activities. As a consequence, we can encounter the extensions of SIFT, SURF, HOG, in the 3D domain [61], [59] and [20]. The temporal concatenation of image patch descriptors retains how a specific point and the region around it (i.e. containing its gradients) changes throughout time in order to adequately represent activities. However, more motion information can also be included in the descriptor, if specific motion attributes are taken into consideration (i.e. Optical Flow). This has been introduced in [12] and [21]
where optical flow and its gradients are imported to a patch-based descriptor, leading to HOF [12] and MBH [21] structures respectively. An evaluation of these descriptors and their detectors is provided in [60].
Apart from describing regions around spatio-temporal interest points, the tracking of these points can also result in robust features for activity representation. For instance, in [11] a KLT tracker is used to form trajectories, while [10] propose matching SIFT descriptors using a Markov chain model in order to create correspondences among points.
In both cases, trajectories are stored in a log-polar histogram of tracked velocities. In [62] the authors introduce trajectons. Densely sampled interest points are tracked using a simplistic tracking technique in [21], while an improvement of the same with motion compensation is given in [43]. Trajectories are also used in more recent works such as [44,45]. Characteristic examples of their results are depicted in Fig. 2.6.
Currently, methods that use 3D local volumes with spatio-temporal information achieve the highest accuracy in activity recognition when used in a BoVW framework (K-Means clustering combined with a Chi-Square distance). However the main drawback that BoVW have is the lack of geometric relations between the features. Earlier work [12], inspired by spatial pyramid matching [63] which met with wide success in image classification, introduced weak geometric relations among descriptors in the BoVW framework, as depicted in Fig.2.7. Further progress in the topic was made in [22] where an advanced hierarchical combination of features was proposed along with a data mining technique for improving recognition. Context information is also introduced in [10] where cuboid trajectory neighbourhoods are represented with a SIFT descriptor and relations among them are captured by a stationary Markov distribution vector at different levels. Recent work with several spatial pyramids was also introduced in [64], but did not achieve satisfactory improvements when compared with previous methods.