Development of a visual object tracking system using a compressive sensing framework for modeling the appearance
57
0
0
Texto completo
(2) UNIVERIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN. TRABAJO FIN DE GRADO Grado en Ingeniería de Tecnologías y Servicios de Telecomunicación. DEVELOPMENT OF A VISUAL OBJECT TRACKING SYSTEM USING COMPRESSIVE SENSIG FRAMEWORK FOR MODELLING THE APPEARANCE DESARROLLO DE UN SISTEMA DE SEGUIMIENTO DE OBJETOS USANDO LA TEORÍA DE “COMPRESSIVE SENSING” PARA MODELAR LA APARIENCIA Ana Mantecón Jené Julio 2015 TRABAJO FIN DE GRADO.
(3) Autor:. ANA MANTECÓN JENÉ. Tutor:. CARLOS ROBERTO DEL BLANCO ADÁN. Departamento:. GRUPO DE TRATAMIENTO DE IMÁGENES. TRIBUNAL Presidente:. LUIS SALGADO ÁLVAREZ DE SOTOMAYOR. Vocal:. FRANCISCO MORÁN BURGOS. Secretario:. CARLOS ROBERTO DEL BLANCO ADÁN. Suplente:. CARLOS CUEVAS RODRÍGUEZ. CALIFICACIÓN:. Madrid, a. de. 2015..
(4) ABSTRACT In the following work, we address the commonly known problem of visual object tracking, which tries to find the trajectory of an object in a video sequence. This task is carried out by developing a tracking-by-detection method that builds up an effective appearance model in the compressed domain via compressive sensing theory. The only information about the sequence that our approach needs to know in advance is the location of the target object in the first frame. Visual tracking is a classical field of computer vision that has been addressed for many years. However, it is still a very challenging task as many variables such as illumination changes, object occlusion, and the complexity of the background, have to be accounted in order to build a robust tracking algorithm. To deal with these challenges, we propose a tracking-by-detection algorithm that trains an SVM classifier in an online manner to separate the object from the background. We generate the appearance model with a robust high dimensional feature descriptor that describes the objects and the surrounding background of the scene. This step is followed by a dimensionality reduction step that enables to train and use the classifier with a much lower dimension of the data. The dimensionality reduction step is based on the compressive sensing theory that states that a sparse signal can be well-represented and recovered from a small sets of measurements. The compressive sensing problem is also addressed and described in this work, and different measurement and recovery techniques are tested to evaluate the resulting reduced vectors. The purpose is to verify that they are able to keep the essence and information of the high dimensional vectors. We also perform a successful update of the object appearance model by using the current tracker state to extract positive and negative samples, which are used to re-train the classifier. This way our tracker is able to adapt to the appearance variation of the object. The proposed tracking algorithm is evaluated with challenging video sequences, and also compared with other state-of-the-art trackers, proving the success of the performance of our algorithm.. KEY WORDS Visual object tracking, compressive sensing, dimensionality reduction, appearance model, classifiers, DSLQP descriptor, online learning..
(5) RESUMEN En el presente trabajo se aborda el problema del seguimiento de objetos, cuyo objetivo es encontrar la trayectoria de un objeto en una secuencia de video. Para ello, se ha desarrollado un método de seguimiento-por-detección que construye un modelo de apariencia en un dominio comprimido usando una nueva e innovadora técnica: “compressive sensing”. La única información necesaria es la situación del objeto a seguir en la primera imagen de la secuencia. El seguimiento de objetos es una aplicación típica del área de visión artificial con un desarrollo de bastantes años. Aun así, sigue siendo una tarea desafiante debido a varios factores: cambios de iluminación, oclusión parcial o total de los objetos y complejidad del fondo de la escena, los cuales deben ser considerados para conseguir un seguimiento robusto. Para lidiar lo más eficazmente posible con estos factores, hemos propuesto un algoritmo de tracking que entrena un clasificador Máquina Vector Soporte (“Support Vector Machine” o SVM en sus siglas en inglés) en modo online para separar los objetos del fondo de la escena. Con este fin, hemos generado nuestro modelo de apariencia por medio de un descriptor de características muy robusto que describe los objetos y el fondo devolviendo un vector de dimensiones muy altas. Por ello, se ha implementado seguidamente un paso para reducir la dimensionalidad de dichos vectores y así poder entrenar nuestro clasificador en un dominio mucho menor, al que denominamos domino comprimido. La reducción de la dimensionalidad de los vectores de características se basa en la teoría de “compressive sensing”, que dice que una señal con poca dispersión (pocos componentes distintos de cero) puede estar bien representada, e incluso puede ser reconstruida, a partir de un conjunto muy pequeño de muestras. La teoría de “compressive sensing” se ha aplicado satisfactoriamente en este trabajo y diferentes técnicas de medida y reconstrucción han sido probadas para evaluar nuestros vectores reducidos, de tal forma que se ha verificado que son capaces de preservar la información de los vectores originales. También incluimos una actualización del modelo de apariencia del objeto a seguir, mediante el re-entrenamiento de nuestro clasificador en cada cuadro de la secuencia con muestras positivas y negativas, las cuales han sido obtenidas a partir de la posición predicha por el algoritmo de seguimiento en cada instante temporal. El algoritmo propuesto ha sido evaluado en distintas secuencias y comparado con otros algoritmos del estado del arte de seguimiento, para así demostrar el éxito de nuestro método.. PALABRAS CLAVE Seguimiento de objetos, compressive sensing, reducción de la dimensionalidad, modelo de apariencia, clasificadores, descriptor DSLQP, aprendizaje en línea..
(6) TABLE OF CONTENTS 1.. 2. 3. INTRODUCTION ................................................................................................................. 1 1.1. Motivation .................................................................................................................... 1. 1.2. Objectives ................................................................................................................... 2. 1.3. Structure ..................................................................................................................... 2. STATE OF THE ART ............................................................................................................ 3 2.1. Introduction.................................................................................................................. 3. 2.2. Tracking challenges ...................................................................................................... 3. 2.3. General object tracking ................................................................................................5. 2.4. State of the art about trackers ..................................................................................... 6. COMPRESSIVE SENSING THEORY .................................................................................. 13 3.1. Introduction................................................................................................................ 13. 3.2. Fundamental concepts of compressive sensing .......................................................... 14. 3.2.1. Basis representations .......................................................................................... 14. 3.2.2. Sparsity and incoherence .................................................................................... 16. 3.3. The compressive sensing problem.............................................................................. 16. 3.4. Measurement process ................................................................................................ 18. 3.5. Reconstruction process .............................................................................................. 21. 3.6. Compressed learning ..................................................................................................22. 3.7. Conclusions ................................................................................................................ 23. 4 DESCRIPTION OF THE DEVELOPED COMPRESSED SENSING BASED TRACKING FRAMEWORK .......................................................................................................................... 24 4.1. Introduction............................................................................................................... 24. 4.2. System overview ....................................................................................................... 24. 4.2.1 5. Feature extraction ............................................................................................... 27. TRACKING RESULTS ........................................................................................................ 31 5.1. Introduction................................................................................................................ 31. 5.2. Evaluation of the reduced vectors .............................................................................. 31. 5.3. Database and metrics ................................................................................................. 32.
(7) 5.4. Experimental results ................................................................................................... 35. 5.4.1 6. Quantitative comparisons ................................................................................... 39. CONCLUSION AND FUTURE WORK ............................................................................... 44. REFERENCES .......................................................................................................................... 46.
(8) TABLE OF FIGURES Figure 1. Background clutter ...................................................................................................... 4 Figure 2. Partial object (face) occlusion ..................................................................................... 4 Figure 3. Illumination changes. (a) and (b) are different frames from the same sequence. ........ 4 Figure 4. General visual object tracking structure .......................................................................5 Figure 5. (a) Original signal with pixel values between [0,255]. (b) Wavelet coefficients of the image. (c) Reconstruction of the image by a small subselection of its wavelet coefficients. ..... 16 Figure 6. Compressive sensing problem. .................................................................................. 17 Figure 7. Compressive sensing process example. ...................................................................... 17 Figure 8. Compressive sensing boundaries for a signal with a sparsity of 98% ..........................20 Figure 9. Compressive sensing measurement process. .............................................................20 Figure 10. The 𝑙1- minimization problem coincides with the sparsest solution ........................22 Figure 11. Training phase at the n-th frame ..............................................................................25 Figure 12. Tracking phase at the (n+1)-th frame .......................................................................25 Figure 13. SVM training............................................................................................................ 26 Figure 14. Feature extraction implemented. ............................................................................. 27 Figure 15: Local Binary Pattern example: (a) Grey scale neighborhood. (b) Computed intensity differences. (c) Binary value. .................................................................................................... 28 Figure 16: DSLQP descriptor. .................................................................................................. 28 Figure 17. Histogram concatenation ........................................................................................ 29 Figure 18. Compressive Sensing reconstruction example. ....................................................... 33 Figure 19. (a) Positive sampling region and (b) Negative sampling region. ............................... 36 Figure 20: Average Success Rate graphics. ............................................................................... 37 Figure 21. Challenging frames in Cliffbar sequence .................................................................. 38 Figure 22. Challenging frames in Trellis sequence .................................................................... 39 Figure 23. Challenging frames in Sylv sequence ....................................................................... 39 Figure 24. Challenging frames in Football sequence ................................................................. 39.
(9) TABLE OF TABLES Table I: Summary of the most relevant state-of-the-art trackers. ............................................ 12 Table II: Different DSLQP configurations. ................................................................................ 31 Table III: MSE computed for each feature vector. ..................................................................... 32 Table IV: Database video sequences. ........................................................................................ 34 Table V: Best Success Rate for each video sequence with each matrix evaluated ..................... 37 Table VI: Best parameters and results for Cliffbar, Trellis, Sylv and Jumping video sequences. 38 Table VII: Table with the Success Rate (%) results of 12-state-of-the-art trackers and our developed. The best results are shown in red, blue and green fonts respectively. .................... 41 Table VIII: Table with the Center Error Rate (%) results of 12-state-of-the-art trackers and our developed tracker. The best results are shown in red, blue and green fonts respectively. ....... 42 Table IX: Table with the Success Rate (%) results in a selection of videos of the considered database. The best results are shown in red, blue and green fonts respectively. ...................... 43.
(10) 1. INTRODUCTION 1.1 MOTIVATION A classical problem within the field of computer vision is visual object tracking, whose main goal is to generate trajectories of objects of interest by means of the analysis of video sequences. In the past decades, object tracking algorithms have become very popular due to the proliferation of high-powered computers, the availability of high quality and inexpensive video cameras, and the increasing need for automated video analysis. Object tracking is being used nowadays in many different fields such as: . Motion-based recognition: human identification based on the way each person moves, automatic object detection, etc. Automated surveillance: monitoring a scene to detect suspicious activities or unlikely events. Human-computer interaction: gesture recognition, face recognition, etc. Traffic monitoring: acquisition of real time traffic statistics. Vehicle navigation: video based path planning and obstacle avoidance.. All these fields require a robust tracking algorithm that is stable and computationally efficient. One fundamental part in object tracking is the modeling of the object appearance (and sometimes the surrounding background) that usually depends on the application domain of each tracker. A common approach is to generate a description of the object appearance in the form of a feature vector that is invariant, or at least robust, to the tracking challenges. However, robust and accurate models imply extremely long feature vectors that are difficult to manage and process, especially under real-time restrictions, or practical memory limitations. In addition, their processing is affected by the so-called “curse of dimensionality” that can drastically reduce the tracking performance. Compressive Sensing is a revolutionary signal processing technique that can be used to greatly reduce the dimensionality of the long feature vectors that encode the object appearance, but preserving almost all the information. Compressive Sensing states that signals that have certain characteristics can be sampled and recovered with much fewer measurements than using the Nyquist-Shannon criterion. Because of this fact, it is becoming very popular in many machine learning problems to transform the data domain to some other appropriate measurement domain. 1.
(11) 1.2 OBJECTIVES The objective of this work is to develop a new appearance model framework for visual object tracking based on Compressive Sensing. This will allow to use highly accurate and robust appearance models via a dimensionality reduction step, which would be impossible to apply in another way for tracking purposes due to memory restrictions and/or the curse of dimensionality phenomenon. Different measurements and recovery techniques related to the Compressive Sensing theory will be tested to evaluate the resulting reduced feature vectors, which will be delivered to a Support Vector Machine (SVM) classifier to distinguish in each frame in which location the target object is. This SVM will be trained in an online fashion way to adapt continuously to the temporal variations of the target object, offering a long, stable, and robust tracking in challenging situations.. 1.3 STRUCTURE This document is organized as follows. In Chapter 2, a review of how the tracking problem has been addressed during the past years is elaborated, together with the description of the most relevant state-of-the-art tracking algorithms. In Chapter 3, an introduction to the compressive sensing theory is presented. The details of our developed algorithm are described in Chapter 4, where we propose an efficient tracking algorithm with an appearance model based on the compressive sensing theory. The results of numerous experiments and performance evaluations are presented in Chapter 5. We conclude this document in Chapter 6, where the conclusions obtained during the development of this work are stated, along with the potential future work that could improve the performance of our tracker.. 2.
(12) 2 STATE OF THE ART 2.1 INTRODUCTION Tracking can be defined as the problem of estimating the trajectory of an object over time as it moves around the scene by locating its position in every frame of the video, starting from the bounding box given in the first frame. Trackers may also provide the complete region of the image that is occupied by the object at every frame. Trackers have to deal with objects and backgrounds that can change over time. A key task of a visual tracking algorithm is to describe the appearance of the objects. In most tracking algorithms, it is assumed that the object motion has no abrupt changes. Many of them also use a priori information, such as the initial position and size of the target objects, to simplify the problem. In the literature, there has been some success in building trackers for specific object categories, such as faces [1], humans [2], and mice [3]. However, tracking generic objects has remained challenging because an object can drastically change its appearance over time. Nowadays, there is no single approach that can successfully handle all scenarios. The state of the art is still far from achieving results comparable to the human performance. In the next sections, different perspectives of the tracking task are addressed. Firstly, the main challenges that the tracking algorithms have to face are presented. Secondly, the general structure of how the trackers usually handle the tracking problem is described, and also the main approaches that have been used are introduced. And finally, the most relevant state-of-the-art trackers will be presented.. 2.2 TRACKING CHALLENGES Visual tracking is a very challenging task as many variables have to be accounted for in order to build a robust algorithm. The main problems arise from: . Loss of information caused by projection of the 3D world on a 2D image. Noise in images. Complex object motion. Partial and full object occlusions. Complex object shapes. Background clutter (complex background). The deformable and/or articulable nature of certain objects Illumination changes of the object and the scene. Real-time processing requirements. 3.
(13) . Abrupt camera or object motion.. In the next images, we can see some examples of background clutter, partial object occlusion, and illumination changes.. Figure 1. Background clutter. Figure 2. Partial object (face) occlusion. Figure 3. Illumination changes. (a) and (b) are different frames from the same sequence.. The object appearance variations can also be classified as intrinsic and extrinsic variations. Considering pose and shape object variations, as intrinsic; and illumination changes, camera motion and perspective, and object occlusion, as extrinsic.. 4.
(14) 2.3 GENERAL OBJECT TRACKING A typical tracking system can be decomposed into five components: . Input Video. Target Region. Appearance Model: Object Feature Description. Motion model. Object detection and tracking. Model update.. Target Region. Appearance model. Motion model. Model update. Object detection and tracking Target Location. Figure 4. General visual object tracking structure. The target region module defines the region that contains the object to be tracked. The most common strategy is to use a rectangle (bounding box), an ellipse, or a circle. Alternatively, points can be used, which are appropriate for tracking small objects. Silhouette can be also used to define the target boundaries, which is effective for specific trackers, such as those that track pedestrians. The appearance model of the object of interest is initially built from the target region, and later is updated along the time by the model update module. It represents the objects according to its appearance properties: colour, texture, and pixel intensities. The model is usually a feature vector that is obtained from a feature descriptor technique that synthesizes the appearance or shape of an image region. In tracking many different feature descriptors have been used. Some of them are simple, but computationally efficient, such as the Haar-like features [17][18]. Others are more complex descriptors, such as SIFT (Scale Invariant Feature Transform) [19], and SURF (Speeded Up Robust Features) [20] descriptors. The motion model predicts the location of the object by defining also a searching strategy used to find the most likely regions where the object in the current frame can be. Most trackers assume that the target is close to its location in the previous frame. However, if the motion of the target is fast, the target can be lost. More sophisticated strategies define other searching mechanisms, such as a uniform based sampling search from previous target location [21], a probabilistic Gaussian search also centered on the previous target location, as in IVT [22], or motion prediction based on a linear model to reduce the search space [23]. 5.
(15) The object detection and tracking module makes the decision of the new location of the target by fusing information from previous modules (object appearance and motion), and other prior information. This way, it generates the trajectory of an object over time by locating its position in every frame of the video. The tasks of detection and tracking can be performed separately or jointly. In the first case, possible object regions in every frame are obtained by an object detection algorithm, and then the tracker estimates the best detection correspondences across frames. In the second case, the object region and its correspondences are estimated jointly by means of template matching techniques and snakes. A common approach to object detection is to use the information obtained from just one frame of the video. However, there are other methods that use the temporal information from a set of frames to build a more robust tracker reducing false detections. The last module, the model update, accounts for appearance variations computing an update of the appearance model.. 2.4 STATE OF THE ART ABOUT TRACKERS In the literature, many visual object trackers have been proposed. However, not all of them provide a robust performance in challenging sequences. In this section, a review of the most relevant works is presented. In section 5, we will make a comparison of our proposed method with these other methods in terms of their ability to track objects and handle both intrinsic and extrinsic appearance object variations. It is of prime importance to build an effective appearance model in order to build a successfully tracking algorithm. Based on the appearance models, tracking algorithms can be categorized as either generative [5][22][41][42][43][46], or discriminative [13][18][45][48]. Generative tracking algorithms learn a model to represent the target object, which is used to find the image regions which are most similar to the target model, considering some minimal reconstruction error. For example, the IVT method [22] incrementally learns a lowdimensional subspace representation of the target to adapt to appearance changes, and the 𝑙1tracker [5] represents the objects model with a sparse combination of target and trivial templates. Discriminative algorithms treat the tracking problem as a detection task. This approach, also known as tracking-by-detection, has become very popular recently due to the great progress of object detection. These algorithms aim to difference the target from the background. The classifier is trained with object features from an associated object class, and with background features associated with a different class. This way, the classifier is trained to distinguish between the 6.
(16) target object and the background, by finding a decision boundary that separates one from the other in the defined feature space. The classifier estimates the object location on a new frame by searching for the maximum classification score in a local region around the location of the target on the previous frame. Given an estimated object location, positive and negative samples are chosen to update the appearance model. Commonly this is done by taking the current tracker location as one positive training example, and sampling the distant neighborhood to get the negative training examples. Others methods, such as the MIL tracker [13] and CT [18], sample more than one positive example by searching in the close neighborhood of the current location. These positive and negative examples are used to update the classifier at every frame. This updating strategy is able to handle appearance variations, short-term occlusions, and cluttered background. However, as the updated examples might be noisy and potentially misaligned, they can also lead the tracking to the drift problem. This problem occurs when small errors are introduced in the appearance model, which are accumulated along the time, producing finally that the tracker decision drifts away from the target object. The selection of features plays a key role in the performance of the classification, since it is very important to use a set of features that discriminate one class from the other (the object from the background). Once the features are selected, different appearances of an object are learned with a supervised learning approach, a classifier. These learning methods separate one object class from the other by learning a hyperplane in a high dimensional space. This procedure usually requires a large collection of samples from each object class. The classifiers can be used an on-line or off-line learning approach. The on-line mode trains the classifier with the new samples in each iteration, whereas the off-line mode only trains the model once using a fix set of positive and false samples. Although the on-line mode is faster, it usually obtains lower classification results. However, it is necessary for adapting to the changing appearance of unknown objects. For example, the MIL tracker [13], Struck [48], and CT [18] propose efficient tracking-bydetection algorithms using online learning approaches: a boosting classifier, a SVM classifier, and a naive Bayes classifier, respectively. In general, when there is less variability in the appearance of the tracked object, generative trackers tend to give more accurate results than discriminative trackers, as they usually use richer features. However, when dealing with more complicated environments and object variations, discriminative tracker tend to be more robust, as they model the background to avoid the drifting problem. Typically, tracking algorithms did not take into account the appearance variability of the targets, having only a good performance over short periods of time. Now, the most relevant. 7.
(17) state-of-the-art trackers manage adapting appearance models to address this limitation. The next list presents a selection of the most representative ones: . Incremental learning visual tracker (IVT) [22]. The IVT is a generative tracking method that learns the dynamic appearance of the target via an incremental principal component analysis (PCA) technique. The appearance of the target is represented by a lowdimensional subspace that provides a compact notion of the object to be tracked, instead of treating the target as an independent set of pixels. The incremental PCA algorithm continually updates the subspace model to account for the appearance changes of the target during the tracking. This method, proposed by Ross et al, is effective to handle appearance changes caused by illumination and pose variation. However, it is not robust to some challenging factors such as partial occlusion and background clutter, mainly due to the fact that it uses new observations to update the appearance model without previously detecting outliers.. . Fragments-based visual tracker (Frag) [41]. In the Frag tracker, the target object is represented by multiple image fragments or patches. The location of the object in the first frame is known, and then multiple rectangular regions (fragments) are selected at each new frame close to the previous object neighbourhood. For every fragment, an integral image histogram is extracted, and a voting map is created representing the comparison of each candidate histogram with the corresponding image patch histogram. Finally they minimize an error function in order to combine all the vote maps of the multiple fragments. The position with the minimum error value is selected as the new target position. The frag tracker is a generative tracker that is able to handle partial occlusions and pose changes due to the multiple votes, and also takes into account the spatial distribution of the pixel intensities. However, patch-based model is not updated, and therefore this tracker is sensitive to large appearance changes.. . Multiple instance learning tracker (MIL) [13]. The MIL tracker is a discriminative trackingby-detection algorithm that uses the previous tracker position to extract positive and negative samples from the current frame. Their image representation consists of a set of Haar-like features that are computed for each image patch (sample). The appearance model is updated by using a new online boosting classifier. This tracker presents a novel approach for tracking, since until then there was no tracker in the literature that was based on multiple instance learning. It is able to handle appearance changes due to their continuous updating of the model. However, it is shown. 8.
(18) that the tracker decision is drifted into the background of the scene when there are severe illumination changes. . Visual tracking decomposition tracker (VTD) [42] . The VTD tracker decomposes the tracking problem into several basic motion and observation models, and allows them to interact. The appearance model is represented by multiple basic observation models that are constructed by a sparse principal component analysis (SPCA). Each observation model covers a specific appearance of the object, such as its pixel intensities, edges, or saturation. The motion model is also represented by the combination of two basic motion models, one represents smooth motions, and the other one represents abrupt motions. Then, they design multiple basic trackers by associating the basic observation models and the basic motion models, so that each specific tracker takes charge of a certain change in the object. Finally, all trackers are combined interactively. This tracker is shown to efficiently address the tracking of an object whose motion and appearance change drastically.. . . . Least Soft-threshold Squares Tracking (LSST) [47]. The LSST tracker proposes a generative tracking approach based on the Least Soft-threshold Squares (LSS) method, and build a dictionary (appearance model) using PCA basis vectors. They calculate a LSS distance to measure the difference between an observation sample and their learned dictionary. They also add an update scheme to account for appearance changes. They show that the LSS distance provides good performances when handling outliers (e.g. dealing with partial occlusion). Structured output tracking with kernels (Struck) [21]. The Struck tracker handles the tracking problem with a discriminative approach that learns online a kernelized structured support vector machine (SVM) to provide an adaptive tracking. To describe the target object, they use Haar-like features that the classifier learns to build up and continuously update the appearance model. This tracker takes advantage of the wellknown SVM classifier, which provides a lot of benefits in terms of robustness to noise. Real Time Compressive tracking (CT) [18]. This tracking algorithm is a tracking-bydetection algorithm that builds a discriminative appearance model from features extracted in the compressed domain, and tries to separate the target from its surrounding background via a simple Bayes online classifier. Positive and negative samples are extracted at each new frame to update the model. To build their appearance model, each sample is described as simple generalized Haar-like features, and compressive sensing is used as a dimensionality reduction technique to obtain features in a much lower dimension, leading to an efficient real-time tracker. However, the simplicity of the Haar-like features makes this method not very robust to heavy occlusion, or abrupt appearance object changes. Other representations 9.
(19) such as local binary patterns have been shown to be more effective in handling this kind of variations. . 𝐿1 - tracker (L1) [5]. The 𝐿1 - tracker is a generative algorithm that assumes that the tracked object can be well represented by a sparse linear combination of object templates, and trivial templates. Object templates are used to describe the object class to be tracked, and the trivial templates are used to handle noise and occlusion. Each trivial template has only one nonzero element, representing a specific feature. A good target candidate should involve fewer trivial templates. A candidate sample is sparsely represented by both target and trivial templates, and its corresponding likelihood is determined by the reconstruction error obtained via 𝑙1 - minimization with respect to previously known target templates. Finally, the tracking result is chosen as the target template with the minimal reconstruction error. This innovative tracking method, developed by Mei et al. has been demonstrated to be very robust against partial occlusions. However, there are several issues to be addressed. First, the reconstruction error obtained via 𝑙1 - minimization requires high computational cost. Secondly, the reconstruction error of templates, representing an occluded target and the background, can be both small, and therefore the final selection of the target template, as de the minimal reconstruction error, may be wrong and cause the tracker to fail. The 𝐿1 - tracker is the first tracking approach to deal with the representation of an object as a sparse representation. Due to its high success, it has been used as a reference for many other trackers.. . . Multi-task sparse learning tracker (MTT) [43]. Zhang et al. developed the MTT tracker inspired by the success of the 𝐿1 - tracker. They proposed a multi-task sparse learning approach for visual tracking in a particle filter framework. They model the particles as a linear combination of dictionary templates that are updated dynamically. The next target state is selected to be the particle that has the highest similarity with a dictionary of target templates. They also exploit the similarities among particles to find a joint representation, which is an improvement from the 𝐿1 - tracker that handled particles independently. They prove that this approach is able to deal with partial occlusion more accurately. Moreover, learning the particles representation jointly makes their approach significantly faster than the 𝐿1 - tracker. Online Robust Non-negative Dictionary Learning for Visual Tracking (ONND) [44]. Based on the 𝐿1 - tracker, Wang et al. proposed an online robust non-negative dictionary learning algorithm for updating the target templates, so that each learned template can capture a distinctive aspect of the tracked object. Their approach combines the past information with the current tracking result. The object representation is formulated 10.
(20) . using the Huber loss function. This way they do not need to use trivial templates, leading to a significant reduction of the computational cost. Moreover, they are able to automatically detect and reject occlusion and clutter background. Visual tracking via adaptive structural local sparse appearance model (ASLA) [46]. As previously described, sparse representations have been exploited by previous trackers [5][43][44] to find the best candidate by looking at the minimal reconstruction error. However, most of these trackers only consider the holistic representation of the objects to discriminate between the target and the background, which can lead to failure when there are similar objects, or occlusions in the scene. The ASLA tracker develops a tracking method based on a structural local sparse appearance model that exploits both partial and global spatial information. The algorithm samples overlapped local image patches near the target region, and obtains a similarity measure of each sample with the previously obtained target region. It also has an adaptive update strategy to account for appearance changes of the target. This tracker is able to handle partial occlusion quite effectively.. . Sparsity-based collaborative tracker (SCM) [45]. The SCM tracker is a hybrid method that combines both generative and discriminative approaches. The representation scheme for object tracking consists of holistic intensity templates and local histograms. They build a collaborative model that integrates a sparse discriminative classifier based on the holistic templates, and a generative model using local representations that consider the spatial information, and add an occlusion handling scheme. The update of the appearance model considers both the latest observations and the current one, so as to deal with the appearance changes. This tracker effectively deals with cluttered background due to its discriminative approach, and with heavy occlusion due their developed occlusion handling module.. In Table I, a summary of the state-of-the-art trackers previously described is presented.. 11.
(21) SCM [46]. LSST [48] Struck [49] CT [18] L1T [5] MTT [44] ONND [45] ASLA [47]. VTD [43]. Frag [42] MIL [13]. IVT [22]. Trackers. holistic image intensity and local histograms. Object representation holistic image intensity incremental local intensity histogram Haar-like features hue, saturation, intensity and edge template holistic image intensity Haar-like features Haar-like features holistic image intensity holistic image intensity holitic image intensity local image patches. structured SVM naive Bayes -. generative generative discriminative discriminative generative generative generative generative. sparse principal component analysis principal component analysis sparse representation multi-task sparse learning sparse representation sparse representation. SDC (Sparsity-based discriminaive classifier). boosting generative discriminative. – online multiple instance learning. hybrid. -. generative. principal component analysis. sparse representation. Classifier. Approach. Adaptive appearance model. Table I: Summary of the most relevant state-of-the-art trackers.. 12.
(22) 3 COMPRESSIVE SENSING THEORY 3.1 INTRODUCTION Compressive sensing (CS) has attracted considerable attention in the areas of applied mathematics, computer science, and electrical engineering by suggesting that it is possible to surpass the traditional limits of sampling theory, the Shannon-Nyquist sampling theorem. Shannon-Nyquist sampling theory states that signals (images, videos, and other data) can be exactly recovered from a set of uniformly spaced samples taken at the so-called Nyquist rate of twice the highest frequency of the signal of interest. Digitalization has enabled the growth of sensing systems that are now able to generate a very large amount of data. Unfortunately, in many applications the number of samples that need to be taken following the Nyquist rate is very high, and we end up with far too many samples that are very difficult to manage and process, especially under real-time restrictions or practical memory limitations. To cope with the logistical and computational challenges involved in dealing with very high-dimensional amount of data, compression is usually used. Compression techniques try to find a mathematical transformation (usually a unitary transformation) that provides a sparse or compressible representation of a signal of interest, as both sparse and compressible signals can be well represented by their largest coefficients. More specifically, the meanings of sparse and compressible in this context are: . Sparse representation: a signal of length 𝑁 that can be represented with 𝐾 << 𝑁 nonzero coefficients. Compressible representation: a signal of length 𝑁 that is well approximated by a signal with only 𝐾 nonzero coefficients.. Compressive Sensing is a new theory that grew out of the work of Candès, Romberg, Tao, and Donoho, which states that a finite-dimensional signal having a sparse representation in a basis 𝛹 can be recovered from a small set of measurements using a measurement matrix Φ. Both the measurement matrix and the basis matrix need to be as much incoherent as possible. Sparsity, incoherence, and basis representations are the fundamental concepts of CS. In the following sections of this chapter, a brief introduction to the compressive sensing theory is presented.. 13.
(23) 3.2 FUNDAMENTAL CONCEPTS OF COMPRESSIVE SENSING 3.2.1 Basis representations Signals can be represented with a basis, called a basis representation of a signal. By definition it means that the signal can be represented by a fixed set of basis signals (vectors), each of them with a coefficient associated. 𝑁. A set of vectors {𝜓𝛾 }𝛾=1 is called a basis for ℝ𝑁 , if the vectors are linearly independent and span all ℝ𝑁 . For any x ∈ ℝ𝑁 , there exists a unique set of coefficients 𝛼(𝛾) such that: 𝑛. 𝑥(𝑡) = ∑ 𝛼(𝛾)𝜓𝛾 (𝑡). (1). 𝛾=1. It is possible to represent different signals with the same basis by changing the coefficients associated with each basis signal. A basis representation of a signal can be considered as a way of discretizing it, so as to make it more manageable to work with. Moreover, it is a linear transformation that makes it possible to reconstruct the original signal exactly, i.e. it is an invertible transformation. Basis representations are used very often, for example, when signals are represented in the Fourier or DCT domain: 𝑥(𝑡) = ∑ 𝛼(𝑘)𝑒 𝑗2𝜋𝑘𝑡 , 𝑤ℎ𝑒𝑟𝑒. 𝜓𝛾 (𝑡) = 𝑒 𝑗2𝜋𝑘𝑡. (2). 𝑘𝜖ℤ. A special case of a basis is an orthonormal basis, also known as orthobasis, which is defined 𝑁. as a set of signals {𝜓𝛾 }𝛾=1 in the space ℝ𝑁 that satisfy the following criteria: . They are orthogonal: ⟨𝜓𝛾 , 𝜓𝛾′ ⟩ = {. 1 𝛾 = 𝛾′ 0 𝛾 ≠ 𝛾′. (3). where ⟨𝑥, 𝑦⟩ is the dot product. . They have unit norm |𝜓𝛾 | = 1, where |x| is the norm of x.. . There is no 𝑥 𝜖 ℝ𝑁 , 𝑥 ≠ 0, such that ⟨𝜓𝛾 , 𝑥⟩ = 0.. In matrix notation, a basis can be represented as a 𝑁𝑥𝑁 matrix 𝛹, whose columns are given by the basis signals 𝜓𝛾 . The orthonormality of a basis means that 𝛹 is orthogonal (its column vectors or row vectors are an orthonormal set of vectors), which implies that 𝛹 𝑇 𝛹 = 𝐼, being 𝐼 the 𝑁𝑥𝑁 identity matrix.. 14.
(24) Orthobasis allow every signal to be decomposed as a linear combination of elements, and gives a simple way of calculating the coefficients by computing the inner product between the signal and each basis signal by means of the dot product: 𝛼𝛾 = ⟨𝑥(𝑡) , 𝜓𝛾 (𝑡)⟩. (4). To sum up, if {𝜓𝛾 }𝛾𝜖 𝛤 is an orthobasis for a generic vector space 𝐻, then every 𝑥 𝜖 𝐻 can be represented with the following reproducing formula: 𝑥(𝑡) = ∑⟨𝑥(𝑡), 𝜓𝛾 (𝑡)⟩𝜓𝛾 (𝑡). (5). 𝛾𝜖 𝛤. An important property of having a basis representation of a signal is that it can be represented in a more compact way as a matrix multiplication. For that, two linear operators are associated to every orthonormal basis: the synthesis operator and the analysis operator. . Synthesis operator 𝜳. This operator takes a sequence of coefficients and builds up a signal with them. (6). 𝑥 = 𝛹𝛼. Where 𝛹 is a 𝑁𝑥𝑁 matrix whose columns are given by the basis signals 𝜓𝛾 , and 𝛼 is the length-𝑛 vector of coefficients 𝛼𝛾 . . Analysis operator 𝜳𝑻 . This operator maps the signal to a sequence of expansion coefficients. 𝛼 = 𝛹 𝑇𝑥. (7). It is very important to know that the Parseval theorem is perfectly preserved when a signal is represented by a complete orthobasis. The Parseval theorem states that if 𝑥(𝑡) is a continuous-time signal, and 𝛼𝛾 = ⟨𝑥(𝑡) , 𝜓𝛾 (𝑡)⟩ are its basis coefficients, then 2. ||𝑥(𝑡)|| = ∫ |𝑥(𝑡)|2 𝑑𝑡 = ∑ |𝛼(𝛾)|2 = ||𝛼||2. (8). 𝛾𝜖 𝛤. This means that the energy between the original signal and its coefficients is preserved. Consequently, the inner product is also preserved. A consequence of the Parseval theorem is that any space of signals with an orthobasis can be discretized. Also, the energy preservation essentially means that it preserves all the distance and geometrical relationships of the original signal in its transformed coefficients. To sum up, every continuous signal, for which we can find an orthobasis to be represented with, can be processed by manipulating discrete sequences of numbers (the coefficients).. 15.
(25) 3.2.2 Sparsity and incoherence A signal 𝑥 is 𝐾-sparse if it is composed only by 𝐾 non-zero coefficients: ||𝑥||0 ≤ 𝐾. (9). where ||𝑥||0 is the 𝑙0 norm which gives the number of non-zero terms in the vector 𝑥. Compressive Sensing exploits the fact that many real signals can be represented as sparse signals when expressed with a proper basis. For example, considering an image where all its pixels have different non-zero values, the wavelet basis represents this image as a sparse signal, where the majority of coefficients are small, and those few principal coefficients preserve almost all the information of the image. Figure 5 shows an example with an image.. (a). (b). (c). Figure 5. (a) Original signal with pixel values between [0,255]. (b) Wavelet coefficients of the image. (c) Reconstruction of the image by a small subselection of its wavelet coefficients.. In conclusion, when dealing with a 𝐾-sparse signal, it is possible to discard a big amount of its coefficients without losing much information about the signal. Other important concept to take into account is the incoherence between two matrices. Two matrix are incoherent if no column or row of each matrix can be obtained as a linear combination of the elements of the other matrix. It is important to state that random matrices are highly incoherence respect to any basis Ψ.. 3.3 THE COMPRESSIVE SENSING PROBLEM In a general case, the data is not sparse in the observed domain. However, there might be a possible unknown basis (𝛹) in which the data 𝑥 can be sparsely represented: 𝑥 = 𝛹 · 𝑠 , being 𝑠 a 𝑘-sparse vector.. (10). Compressive Sensing claims that if a signal x ∈ ℝ𝑁 is 𝐾-sparse in some basis Ψ, then it can be recovered with high probability using only a small amount of measurements of x obtained from a measurement matrix Φ ∈ ℝ𝑀 (i.e. a sampling of x), as long as the measurement 16.
(26) matrix Φ is incoherent with the basis matrix Ψ, and satisfies the Restricted Isometry Property (RIP), which is stated in the next sub-section. Mathematically: 𝑦 = Φx= ΦΨs= θs. (11). where 𝑦 is the sampled/measured vector and θ = Φ Ψ is a 𝑀𝑥𝑁 matrix. Notice that the dimensions of 𝑦 are less than 𝑥 (see Figure 6).. Figure 6. Compressive sensing problem.. The problem of CS can be divided into: a) Measurement process. Design of a 𝑀 × 𝑁 measurement matrix Φ that ensures that the information of a k-sparse signal is nearly preserved by the dimensionality reduction process from x ∈ ℝ𝑁 to y ∈ ℝ𝑀 . b) Reconstruction process. Finding a reconstruction algorithm to recover 𝑥 from only 𝑀 ≪ 𝑁 measurements by solving an optimization problem.. Figure 7. Compressive sensing process example.. 17.
(27) These both processes, which are intuitively represented at Figure 7, will be explained further in the following sections.. 3.4 MEASUREMENT PROCESS The measurement process consists on projecting the original signal into a few set of measurements by means of a measurement matrix Φ. This can be seen as a subsampling of the signal, in this case a discrete vector. In this section, the compressive sensing measurement process is reviewed together with an essential property that needs to be fulfilled in order to be able to successfully represent the high-dimensional signals in the lower dimension. Moreover, four measurement matrices are presented that satisfy this property, and therefore are commonly used in CS. The measurement matrix Φ must allow the reconstruction of the 𝑁 dimensional signal x ∈ ℝ𝑁 from the vector of measurements y ∈ ℝ𝑀 being 𝑀 ≪ 𝑁. In general, the position of the 𝐾 non-zero entries of the sparse signal is not known and variable. On the other hand, the measurement matrix should be fixed in advance for all ℝ𝑁 , i.e. it does not depend on the signal information structure. The measurement matrix should preserve the information and geometrical relations between the original vectors and the reduced ones. For this purpose, a sufficient condition for 𝐾-sparse signals is that Φ satisfies the Restricted Isometry Property (RIP). The Restricted Isometry Property (RIP) was introduced in [29] by Candès and Tao, and states that a matrix Φ satisfies the RIP for every 𝐾 -sparse vector 𝑥 if there exists a 𝛿𝑘 ∈ (0,1) such that 2. (1 − 𝛿𝑘 )||𝑥||22 ≤ ||Φx|| ≤ (1 + 𝛿𝑘 )||𝑥||22 2. (12). where ||𝑥||2 is the Euclidean norm of 𝑥. If a matrix Φ satisfies the RIP, then Φ approximately preserves the distance and geometrical relations between any pair of 𝐾-sparse vectors, which implies that it preserves the data structure and its information. Moreover, this is also sufficient for a variety of algorithms to be able to successfully recover a sparse signal from its measurements. In the literature, it has been proven that random matrices satisfy the RIP with high probability if their entries are chosen independent and identically distributed (i.i.d.) according to a Gaussian, Bernoulli, or Achlioptas distribution. It has also been shown that if the matrix is randomly selected from Fourier samples, it also satisfies the RIP. Next, a definition of these matrices is presented:. 18.
(28) . Gaussian measurement matrices. The entries of the 𝑀𝑥𝑁 measurement matrix Φ are i.i.d. samples of a normal Gaussian distribution 𝑁(0,1).. . Bernoulli measurement matrices. The entries of the 𝑀𝑥𝑁 measurement matrix Φ are i.i.d. samples of the Bernoulli distribution: 1 1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 (13) 2 Φ𝒊𝒋 = √𝑀 · { 1 −1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 2 Achlioptas measurement matrices. The entries of the 𝑀𝑥𝑁 measurement matrix Φ are i.i.d. samples of the distribution defined by [50]:. . 1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦. 1 2𝑠. 1 𝑠 1 { −1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 2𝑠. Φ𝒊𝒋 = √𝑀 · 0 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1 −. (14). where s=2 or 3 satisfies the RIP. The number of measures that are needed to satisfy the RIP for these three matrices has been delimited by a lower bound proven in [27], which focuses only on the dimensions of the problem: 𝑀 ≥ 𝐶 ∙ 𝐾 ∙ 𝑙𝑜𝑔(𝑁/𝐾). . (15). where 𝑀 is number of measurements taken, 𝑁 is the length of the original signal, 𝐾 is the number of non-zero coefficients of the sparse signal, and 𝐶 is a constant. Partial random Fourier measurement matrices. The entries of the 𝑀𝑥𝑁 measurement matrix Φ are composed of randomly choosing 𝑀 rows of the Fourier matrix, and normalizing its columns. These matrices achieve the RIP if 𝑀𝑓𝑜𝑢𝑟𝑖𝑒𝑟 ≥ 𝐶 ∙ 𝐾 ∙ 𝑙𝑜𝑔(𝑁 4 ). (16). Davenport et al. [35] prove that the constant 𝐶 can be well approximated by 1. 𝐶 = 2 · log(√24 + 1) ≈ 0.28.. (17). 19.
(29) Figure 8 shows a visual representation of the dimensionality reduction of the vectors made by CS using as an example a sparse vector of N=65536 with a 98% of sparsity, ergo it only has around 1000 nonzero components.. Compressive sensing lower boundaries 80000 60000. 65536. 40000 20000. 1161. 12273. M. M_fourier. 0 N. Figure 8. Compressive sensing boundaries for a signal with a sparsity of 98%. Finally, it is very important to state that these random measurement matrices are universal [15] in the sense that the matrix θ = Φ Ψ obeys the RIP with high probability for any arbitrary orthobasis Ψ, if the measurement matrix Φ is designed as a random measurement matrix. As mentioned before, signals are not sparse in general, which is why an orthobasis Ψ is needed to have a sparse representation of our signals. However, having a 𝐾 -sparse high dimensional signal s , the measurement process of compressive sensing can be simply formulated as (Figure 9): 𝑦=Φs. (18). Figure 9. Compressive sensing measurement process.. where Φ is a measurement matrix that satisfies the RIP.. 20.
(30) 3.5 RECONSTRUCTION PROCESS Compressive sensing states that it is possible to reconstruct a 𝐾-sparse signal from a vastly undersampled number of measurements by using efficient recovery algorithms. The problem of recovering a 𝑁 dimensional signal from 𝑀 ≪ 𝑁 measurements is highly undetermined, and therefore it has infinite solutions. However, if it is assumed that the 𝑁-length vector is 𝐾-sparse then the situation changes. Let be a x ∈ ℝ𝑁 𝐾 -sparse vector, from which we take 𝑀 ≪ 𝑁 measurements via a measurement matrix Φ: 𝑦=Φx. (19). then to recover the 𝐾-sparse vector 𝑥 we want to solve the 𝑙0 - minimization problem: min ||𝑥||0. subject to 𝛷𝑥 = 𝑦. (20). where ||𝑥||0 is the 𝑙0 -norm of 𝑥. Which will find the sparsest 𝑥 that is consistent with the measurements y. Unfortunately, this is a combinatorial minimization problem that is computationally intractable. Compressive sensing proposes two different practical and tractable alternatives for this problem, convex programing and greedy algorithms. Convex programing. Convex programing techniques address the problem as an 𝑙1 - minimization problem by simply replacing the 𝑙0 norm with the 𝑙1 norm: min ||𝑥||1. subject to 𝛷𝑥 = 𝑦. (21). where ||𝑥||1 is the 𝑙1-norm of 𝑥. It is shown in [32] that if the matrix 𝛷 satisfies the RIP, the solution to this problem is unique, and the same as the one obtained by solving the 𝑙0 - minimization problem. This is because the 𝑙1 - minimization problem promotes sparse solutions. An intuitive example that shows how the 𝑙1 - minimization leads to the sparsest solution of the problem is presented below. If 𝑁 = 2 and 𝑀 = 1, we are dealing with a line of solutions 𝐹(𝑥) = {𝑥 ∶ 𝛷𝑥 = 𝑦} in ℝ2 , then the solution of the 𝑙1 - minimization problem is unique and is the one with minimal sparsity (only one non-zero entry). This is represented at Figure 10. This is a tractable convex optimization problem that can be posed as a linear program. In the literature, there have been proposed many 𝑙1 - minimization algorithms, such as the 𝑙1 magic [33], the L1LS [26], and Basis Pursuit [31].. 21.
(31) Despite their reconstruction success, in practice convex algorithms require a high computational cost, which make them not viable for certain applications that might require real time performance. Greedy algorithms, on the other hand, are not so reliable about finding the best solution, however they are much less computationally expensive.. Figure 10. The 𝑙1 - minimization problem coincides with the sparsest solution. . Greedy algorithms.. Greedy algorithms attempt to directly solve the problem formulated with the 𝑙0 norm, or its noise extension min ||𝑦 − 𝛷𝑥|| subject to ||𝑥||0 ≤ 𝐾. (22). by iteratively choosing the column of the measurement matrix that reduces the approximation error. The most common algorithms are Matching Pursuit and Orthogonal Matching Pursuit [34].. 3.6 COMPRESSED LEARNING1 One first approach for many machine learning and signal processing problems is to transform the data domain to some appropriate measurement domain, to perform the desires processing task in the measurement domain. A common example of this is the use of a Fourier transformation followed by a low pass filtering in signal processing. In many cases, the data can be represented in a very high dimensional space that is sparse or at least has a sparse representation in some unknown basis. Moreover, if the data is 1 Compressed learning is learning directly in the measurement domain.. 22.
(32) approximately linearly separable in this high dimensional space, then compressive sensing can be used as an efficient one-to-one transform to project the desired data into a lower dimensional domain preserving its linear separability. Thus, the learning of the underlying classifier can be accomplished directly in the low dimensional space (compressed learning1) [25]. Performing the classification task directly in the measurement domain decreases the computational cost required. Moreover, learning in the high dimensional domain can be inefficient due to the so called curse of dimensionality. In the literature, several dimensionality reduction algorithms, such as Principal Component Analysis (PCA), have been used for the same purpose. However these algorithms are computationally less efficient, especially for image and text applications, where random projections have achieved more favorable results. In addition, sparse random matrices provide additional computational savings, as is shown by Bingham and Mannila [40].. 3.7 CONCLUSIONS The traditional Shannon-Nyquist theory states that signals can only be exactly recovered when sampled at least at twice their maximum frequency. However, it is nearly impossible for high dimensional signals to be sampled as such rate as it requires a very big amount of samples that are very difficult or even impossible to process. Compressive sensing is born as a new approach to solve this problems. Compressive sensing states that signals can be recovered with high probability from a very small set of samples (measurements) obtained from its random projection by a random measurement matrix Φ that satisfies the Restricted Isometry Property, provided that the signal is sparse or at least has a sparse representation in some domain Ψ that is incoherent with the measurement matrix Φ. 𝑦 = Φx= ΦΨs= θs. (23). In this chapter, four random measurement matrices that satisfy the RIP according to certain dimensions restrictions have been introduced, which are the Gaussian matrices, the Bernoulli matrices, the Achlioptas matrices, and the partial random Fourier matrices.. 23.
(33) 4 DESCRIPTION COMPRESSED FRAMEWORK. OF THE DEVELOPED SENSING BASED TRACKING. 4.1 INTRODUCTION In this project we have developed a new tracking-by-detection framework for tracking general objects in video sequences. This framework makes the use of a robust appearance object descriptor to build a highly discriminative and efficient appearance-based object model. Robust object appearance descriptors usually produce feature vectors in a very high dimensional domain that make them very difficult to work with due to memory restrictions. In our case, DSLQP descriptor developed in [39] is used, since it is very robust to illumination changes, and highly discriminative. This descriptor generates very long, but also sparse, feature vectors. The high sparsity of these feature vectors made us explore the Compressive Sensing theory to use it as a dimensionality reduction step, which is able to project the high dimensional feature vectors to a much lower dimensional domain, while almost preserving the information of the original feature vectors. This reduction in dimension allows to alleviate in great extent the so called “curse of dimensionality”, which has a direct negative impact in the performance of the classifiers that use feature vectors to determine the object location. We propose an effective tracking algorithm that is inspired on the work of the CT Tracker [18], but using the aforementioned DSLQP descriptor in combination with a compressed sensing stage. Thanks to its high discriminative capacity, it achieves to separate the target object from the surrounding background. In addition, an online Support Vector Machine (SVM) classifier is used, which is able to adapt to the temporal variations of the target object, offering a long, stable, and robust tracking in challenging situations.. 4.2 SYSTEM OVERVIEW In order to track objects from a video sequence, we assume that the tracking bounding box in the first frame is known in advance. Our tracking approach is formulated as a detection task, which can be divided into two phases: the training phase, and the tracking phase (see Figure 11 and Figure 12). The training phase continuously updates the object model at every time step, which in turn is used by the tracking phase to estimate the location of the target object at each time step.. 24.
(34) Negative samples. Feature description step SVM classifier training. Positive samples. Feature description step. Appearance model. Figure 11. Training phase at the n-th frame. Samples. Feature description step. SVM classifier testing. Best sample score. New target location. Figure 12. Tracking phase at the (n+1)-th frame. The design of a model that describes the dynamic appearance of the target is very important, since the object appearance may change drastically due to intrinsic and extrinsic factors, as discussed in Section 2.2. The training phase achieves to adapt the appearance model, in an online fashion, to reflect these changes in order to obtain a robust tracking. This is accomplished as follows. Assume that the location of the target object is known at the step 𝑡, 25.
(35) and represented by a rectangular bounding box. The object appearance model is updated by computing positive and negative feature vectors (or samples) in the neighborhood of the current location (see Figure 11). Positive feature vectors corresponds with regions containing the target object, whereas negative feature vectors contain background regions. The set of positive feature vectors represent possible object appearances, while the negative ones represent the appearance of the near background. The positive samples are taken near the position of the current target, and the negative samples are taken in the surrounding areas of the current target location. The feature extraction step used to characterize the multiple image regions will be described in depth in the next sub-section. The positive samples are manually labelled as 1, and the negative samples are labelled as −1. These vectors are used to train an online SVM classifier that clusters the data into two classes (target object and background) by finding the maximum marginal hyperplane that separates one class from the other. Both classes are continuously updated at each time step with the extracted positive and negative samples obtained at each frame. Therefore, a continuous appearance model update is performed. By training the SVM in an online fashion, it is possible to track objects without requiring an initial database of training images, which most of the times is not available for general objects. The result of the SVM training is a vector 𝑤 that represents a hyperplane that separates the two classes (see Figure 13).. Figure 13. SVM training. After the training phase has updated the appearance model at time step 𝑡, the tracking phase tries to estimate the new location of the object at time step 𝑡 + 1 using the updated appearance model (Figure 12). Assuming that that the target is close to its previous location, a uniform-sampling based search is performed around that location, i.e., the search is accomplished by means of a uniform sampling of coordinates inside the image region where the target object is expected to be. The obtained samples are delivered to the feature extraction step to compute a feature vector per each candidate location, which in turn are used as inputs by the SVM classifier. This computes a classification score for each sample (feature vector) using the updated object appearance model (estimated in the training phase and represented by a 26.
(36) hyperplane). The new target location is determined as the sample with the highest classification score. This process is done iteratively for each pair of consecutive frames: update of the appearance model from the estimated object location, and estimation of the new target locations using the updated appearance model. 4.2.1 Feature extraction The feature extraction technique takes an image region bounded by a bounding box as input, and generates a feature vector that is highly discriminative and, at the same time, low dimensional thanks to a dimensionality reduction step based on compressive sensing (CS). Figure 14 shows the two steps involved in the feature extraction step: computation of the DSLQP descriptor and dimensionality reduction via CS.. Sample. DSLQP descriptor. Feature vector CS step. Final feature vector. Figure 14. Feature extraction implemented.. The DSLQP descriptor [39] is a state-of-the-art feature descriptor that is not only very robust to illumination changes, but also very discriminative in recognizing objects such as hands and faces. The DSLQP descriptor is inspired on the Local Binary Pattern (LBP) descriptor [36], which has become very popular for several tasks, such as face recognition and texture analysis. The main idea of the LBP (see Figure 15) is to calculate the difference between the intensity value of a central pixel and its 3x3 neighbourhood. The computed differences are thresholded using the sign function, and then encoded in an 8-bit binary value. Finally, all the extracted binary values are converted to decimal numbers, which are used to generate a. 27.
(37) histogram of 28 = 256 bins, which is the resultant feature vector that describes the image region.. difference. threshold LBP: 00111111 Decimal: 63. (a). (b. (c). Figure 15: Local Binary Pattern example: (a) Grey scale neighborhood. (b) Computed intensity differences. (c) Binary value.. The most important attributes of the LBP descriptor are its robustness to dramatic illumination changes, and its computational efficiency. In the literature, some variations have also been proved to be robust to rotations and scale changes. The main modification proposed by the DSLQP descriptor is that given a pixel region, instead of computing the difference between the central pixel and its neighbours, the difference among all the pixels on the region is computed. This modification allows to capture more local structure information, improving the discriminative power of the descriptor. Figure 16 shows the steps involved in the DSLQP descriptor, using as reference a neighbourhood of 4 pixels. The computed differences are concatenated into a 10-bit binary value, which are converted to a decimal number that is used to generate a histogram of length 210 = 1024.. Figure 16: DSLQP descriptor.. The original DSLQP descriptor has no global spatial information. To solve this issue, we divide our image region into different blocks of equal size, computing a different DSLQP histogram per block. The final feature vector is obtained by concatenating all the histograms into a single vector (Figure 17).. 28.
(38) Figure 17. Histogram concatenation. The generated feature vectors are very sparse. Moreover, this effect is more significant as the number of spatial divisions increases. On the other hand, as the number of neighbours and spatial division increases, higher is the feature vector dimension. The result is that the use of these vectors could be impractical due to memory requirements. Compressive sensing is the solution adopted to reduce the DSLQP feature vectors with hardly loss of information, guaranteeing that the geometrical relationships among the feature vectors are preserved. As explained in Chapter 3, one key factor in CS is the design of the measurement matrices that successfully achieve the Restricted Isometry Property (RIP). We have implemented three CS measurement matrices Φ: . Gaussian measurement matrix, whose entries are i.i.d. samples of a normal Gaussian distribution 𝑁(0,1).. . Bernoulli measurement matrix, whose entries are i.i.d. samples of the Bernoulli distribution, given by the following equation. 1 2 Φ𝒊𝒋 = √𝑀 · { 1 −1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 2 1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦. (24). where 𝑀 is the number of rows of the matrix.. 29.
(39) . Achlioptas measurement matrix. The entries of the measurement matrix Φ are i.i.d. samples of the following distribution: 1 6 2 Φ𝒊𝒋 = √𝑀 · 0 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 3 1 {−1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 6 1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦. (25). The three matrices satisfy the RIP when at least 𝑀 ≥ 𝐶 ∙ 𝐾 ∙ 𝑙𝑜𝑔(𝑁/𝐾) , where the constant 𝐶 ≈ 0.28, 𝑀𝑥𝑁 are the dimensions of the matrix, and 𝐾 is the sparsity of the expected input feature vector. The reduced vectors allow not only to reduce the computational cost and memory requirements of the system, but also to improve the performance of the SVM classifier by training and classifying the samples in a much lower dimensional space.. 30.
(40) 5 TRACKING RESULTS 5.1 INTRODUCTION In this Chapter different experiments of the proposed tracking algorithm are presented. Firstly, the designed feature extraction step (see Section 4.2.1) is evaluated with different configurations of the DSLQP feature descriptor, together with different measurement matrices. Following, a database with very different challenging video sequences is presented, together with the evaluation metrics used to evaluate the performance of the tracker. Then, the results of an exhaustive evaluation of the developed tracker is presented, together with the description of the most important parameters that need to be addressed in order to achieve the best performance. Finally, the optimal results of our tracker are presented against 12 state-of-theart trackers, proving the success of the performance of the developed algorithm.. 5.2 EVALUATION OF THE REDUCED VECTORS The efficiency of the dimensional reduction carried out by the application of the CS framework has been evaluated by quantifying both the reduction and the distortion of the resulting reduced feature vectors. We have tested our feature extraction step using four different variations of the DSLQP descriptor, changing the number of blocks into which we divide our images: 𝐷𝑆𝐿𝑄𝑃1 : no spatial division, 𝐷𝑆𝐿𝑄𝑃2 : 4 blocks, 𝐷𝑆𝐿𝑄𝑃4 : 16 blocks, and 𝐷𝑆𝐿𝑄𝑃8 : 64 blocks. Table II shows examples of different DSLQP configurations, along with the dimension 𝑀 of the reduced vectors obtained for each original vector of dimension 𝑁 and with a sparsity value of 𝐾. As we can see, the resulting reduction is very significant. Table II: Different DSLQP configurations.. N Sparsity (%) K M. 𝑫𝑺𝑳𝑸𝑷𝟏 1024 88.98 113 70. 𝑫𝑺𝑳𝑸𝑷𝟐 4096 90.99 369 249. 𝑫𝑺𝑳𝑸𝑷𝟒 16384 95.06 810 682. 𝑫𝑺𝑳𝑸𝑷𝟖 65536 98.49 988 1161. To measure the distortion, the Mean Squared Error (MSE) between the original feature vector and the reconstruction of the reduced feature vector has been computed. For the reconstruction task, the recovery algorithm L1LS [26] has been used, which tries to minimize the following 𝑙1 - minimization problem: min ||𝑥||1. subject to 𝛷𝑥 = 𝑦,. (26) 31.
Documento similar