DE LA COMISIÓN DE ADMISIÓN DE DIRECTORES RESPONSABLES DE OBRA Y CORRESPONSABLES

with an efficient detection algorithm such as AdaBoost it allows near real-time implementation [Zhu et al. 2006], the run time is still significantly higher than a similar framework using integral arrays Viola and Jones [2001] to compute wavelet like feature vectors. • Another disadvantage of our system is the high-dimensionality of its feature vectors. Of

course, this results in good discrimination and is thus critical for the overall performance of the detector. However, during learning (with the current batch algorithm) the feature vectors for all training images must be stored in memory. The large size of the vector limits the number of examples (in particular hard negatives) that can be used and thus ul- timately limits the detector performance. Hence there is a trade off based on the available training memory, and very large feature vectors tend to be suboptimal.

• The descriptors have a relatively large number of parameters that need to be optimised. Although Chapter 4 pointed out that most of the parameters are remarkably stable across different object classes, some optimisation does need to be performed for each object class and the parameter space is too large to allow every possible combination to be tested.

8.3 Future Work

This section provides some pointers to future research and discusses some open issues in visual object detection.

Detection of moving persons in videos.

As motion HOGs are slower to compute and have different characteristics than static HOGs, an object detector for video sequences could use a rejection chain algorithm to build cascades of detectors. As the appearance channel is typically somewhat more informative than the motion one, the chain would probably learn to reject most image windows using appearance only, using motion HOG descriptors only in cases where motion information is present to further reject false positives. This approach might provide good overall performance while significantly speeding up the run time compared to the current method based on a single monolithic SVM classifier. Also, the current motion HOG descriptors use only two consecutive images of a video sequence, whereas good recognition results in humans require temporal information to be integrated over somewhat longer time periods (at least over 3–4 frames1_{). Thus another future direction could} be to use more frames to compute motion HOG descriptors and to study the impact on the detector performance.

The approach to capturing relative dynamics of different body parts that is used in our Internal Motion Histograms is not ideal. It would be interesting to use the part detectors built in Chapter 7 to first detect the various body parts and then try to explicitly encode their relative motions. This raises two issues. Firstly, it seems intuitively that for this approach to perform well, finer grained part detectors such as upper and lower leg detectors and arm detectors may be needed. But given the current state of the art, the reliable detection of small body parts is very challenging. Secondly, our experiments in Chapter 7 show that it is best to incorporate part detector votes from more than one location (typically using the 3–5 locations with the highest confidence values suffices for good results). This implies that if computing relative motions, 1 _{[Johansson 1973] experiments show that humans need around a fifth of a second – approximately 3-5}

100 8 Conclusions and Perspectives

one might have to evaluate many possible pairs of combinations. Of these, only those that are exactly on the real body parts would be relevant and the rest would need to be filtered out. It is currently unclear how to achieve this.

Another application in which motion HOGs may prove useful is activity recognition. Here the tasks involve the classification of characteristic movements in videos, and motion HOGs may prove to be useful features owing to their robust motion encoding.

Texture and colour invariant features.

It is also worth investigating texture and colour invariant descriptors or feature spaces. In con- junction with the HOG representation of the visual form, such feature vectors would form a more complete representation that should allow the current approach to be extended to many more object classes. The overall system could use AdaBoost to learn the most relevant features for each object class and to perform all of the stages of optimisations at once. This might also allow us to avoid the extensive parameter tuning of the descriptors. The training time for such a system would be long, but careful implementation should be able to maintain good run time when the system is in use.

Fusion of Bottom-Up and Top-Down Approaches.

It would be interesting to explore the fusion of bottom-up and top-down approaches along the lines of Leibe et al. [2005]. However rather than going from sparse points during the bottom- up stage to dense pixel-wise representations during the top-down stage as in Leibe et al., one could use dense HOG like features to perform bottom-up object or part detections, and then verify these in a sparse top-down approach, such as one that fits potential part detections to a structural model for the object class.

Another challenging issue while relating to top-down information is the exploitation of the general context. Recently several researchers have begun to use context by modelling the rela- tionships between different objects or object classes, surrounding image regions, or scene cat- egories [Kumar and Hebert 2003, Murphy et al. 2003, Sudderth et al. 2005, Kumar and Hebert 2005, Hoiem et al. 2006]. In particular Hoiem et al. [2006] show that by using the interplay of

(a) (b) (c)

Fig. 8.2. Scene context can help improve the performance of bottom-up low-level object detectors. (a) Results of bottom-up static HOG person detector after multi-scale dense scan. (b) Results of static HOG car detector after similar scan. (c) Results obtained using Hoiem et al. [2006] approach which takes into consideration the interplay between rough 3-D scene geom- etry, approximate camera position/orientation and low-level person and car detectors. Images courtesy of Derek Hoiem.

8.3 Future Work 101

different objects, scene orientation and camera viewpoint, the performance of existing bottom- up object detectors such as those presented in this thesis can be further improved. Figure 8.2 illustrates this. However, current approaches uses manually designed encodings for the contextual information. In the future it would be interesting to expand this to a broader range of background cues and to add higher-level intelligent reasoning to support contextual inferences.

A

Data Sets

Any new feature set must be carefully validated on appropriate data sets to show its potential for real-world applications. The data sets should be chosen to be representative for the applications under consideration. It is also crucial that they should not contain selection biases that will perturb the results. This thesis presents feature sets for both static images and video sequences. Our primary goal is person detection so we proposed two challenging data sets reflecting this application: a static person data set and a moving person data set. This appendix describes these and the other data sets used for our evaluation, and also explains how annotations were performed.

In document REGLAMENTO DE CONSTRUCCIONES PARA EL DISTRITO FEDERAL. Reglamento publicado en Gaceta Oficial del Distrito Federal, el 29 de enero de 2004. (página 30-35)