5. MARCO TEÓRICO
5.4 Planes de Gestión Ambiental
Occlusions are another problem that characterize video-surveillance systems based on computer vision; for example, in the environments of Fig. 10.9, people can walk behind the columns, and, in such a situations, the system is likely to lose them. To face this problem, we have introduced some rules inside the tracking system. When a track disappears, it is not deleted immediately, but its appearance
(a) Input - frame 1076 (b) Input - fr. 2155 (c) Input - fr. 2156
(d) Output - fr. 1076 (e) Output - fr. 2155 (f) Output - fr. 2156
(g) Background - fr. 1076 (h) Background - fr. 2155 (i) Background - fr. 2156
Figure 10.10: Sensor-guided background update.
is kept unchanged and an estimation of the track position is computed exploiting a constant velocity assumption. If the person returns visible again with a similar appearance and a position near to the predicted one, then the system assigns the same label of the disappeared track. However, if the person changes direction during the occlusion, the system is not able to correctly assign the label anymore.
For this reason, we exploit a PIR sensor node placed behind the column. As above mentioned, these sensors detect not only the presence of a person, but also his direction. Then, we can detect a change of direction capturing couples of op- posite direction events sent in a short temporal window. In such a situation, the direction of the motion applied to the track is inverted in order to estimate the position frame by frame.
In Fig. 10.11 an example of consistent labeling after an occlusion is reported. The person walks behind a column and, during the occlusion, inverts his direction. The computer vision tracking algorithm is not able to solve the consistent labeling because the person appears again too far with respect to the predicted position (computed with a constant velocity assumption). Using PIR sensors, instead, the change of direction is detected and the estimated track position can be properly updated. Then, when the person appears again, the tracker assigns the same label
(a) Before occlusion (b) After occlusion
Figure 10.11: Consistent labelling after an occlusion exploiting a PIR sensor node to detect direction changes
(24) assigned before the occlusion (see Fig. 10.11(b)).
Differently from the previous example, in this case the sensor network detects an event and the manager thread informs the involved cameras of it. Then, if a tracking system has detected an object in the corresponding position, the motion direction is changed accordingly.
10.6
Conclusions
Distributed surveillance is a challenging task for which a robust solution, working on a 7/24 basis, is still missing. This chapter is meant to propose an innovative solution that integrates cameras and PIR (Passive InfraRed) sensors. The proposed multi-modal sensor network exploits simple outputs from the PIR sensor nodes (detecting the presence and the direction of movements of people in the scene) to improve the accuracy of the vision subsystem.
Two case studies are reported. In the first, the vision system, based on back- ground suppression, fails due to a door that is opened. Since background is not immediately updated, the door is detected as a moving object (resolution is not sufficient to enable a correct motion detection). In this case, a PIR sensor is used to discriminate between the opened door and a real moving person. In the second case study, a person changes its direction when it is occluded by a column. The vi- sion tracking algorithm relies on the constancy of the speed during occlusions and thus fails. A pair of PIR sensors are, instead, used to detect the change in direction and alerting the vision system.
The reported results demonstrate that using the integration between PIR sen- sors and cameras the accuracy can significantly be increased.
Semantic Video Transcoding
11.1
Introduction
In this chapter we present a framework for remote surveillance. The proposed sys- tem aims to keep under control an environment using computer vision techniques to generate a compact representation of the scene and virtually reconstructing it on mobile devices as final result.
Remote surveillance on mobile devices is becoming a wide market demand in many contexts. First, centralized control centers for visual surveillance have management costs much higher than a network of distributed and mobile control points. Centralized surveillance also has some limits in terms of efficacy, since people employed for watching tens of monitors and videos coming from hundreds of cameras cannot keep their attention on all the controlled scenes. Therefore, there is a high demand of connections between control centers and distributed mobile platforms to send in real-time surveillance data, images and videos.
In the last years the technology of robust video streaming on mobile devices has been improved but sometimes it is unfeasible or cannot be provided with an accept- able quality due to the unavailability of the connection or the lack of transmission stability. Moreover, the current standards GPRS or UMTS frequently insert a not negligible delay in streaming transmission, so that surveillance images cannot be received in real-time everywhere and every time.
A possible solution is the intra-media transcoding of visual data to textual in- formation with a manual or automatic extraction of surveillance knowledge from the videos. Examples are the detection of people, moving objects, or suspicious abandoned packs in the scene, the counting of how many people are moving, the estimation of their position and their behavior. The textual information can be easily transmitted in real-time to mobile devices, that can show them in a textual form or with a graphical interface, e.g. virtually reconstructing the scene. The goal is the definition of a virtual environment corresponding to the real controlled one where the useful surveillance data are kept. In such a manner the textual surveil- lance knowledge can be transmitted in an affordable and robust way to mobile
devices where the visual information can be reconstructed. Geometric 3D data on the background scene are pre-loaded in the mobile device and only the dynamic information is transmitted in real time.
Figure 11.1: Scheme of the overall architecture.
In the proposed architecture of remote surveillance platforms (see Fig. 11.1), the video streams are processed in real-time by local servers, which extract the surveillance knowledge on the environment, provide a semantic transcoding of the visual data and make the data available for mobile connection.
The semantic transcoding can be provided in intra-media modality as well as the inter-media modality above described. In the first case videos are processed and modified in order to provide compression rates that are acceptable in remote mo- bile connections without loosing important data. Differently from the inter-mdeia modality the output of the transcoding system is still a video stream and not a tex- tual information. As in [218, 219] background and moving objects can be coded at different compression rates, sent separately and reconstructed at the client-side. Instead, in the intra-media modality the scene is virtually reconstructed with com- puter graphic techniques, allowing the transmission of few textual data only. This second modality is less realistic but has several peculiar advantages. Firstly, the 3D scene is reconstructed so that all 3D information is available, new views can be provided, different from the field of view of the camera and a virtual interactive navigation is allowed. Secondly, there is the possibility to filter out some critical data that should not be transmitted for privacy issues. For instance, current laws of many countries do not allow the use of surveillance data in commercial sites such as offices or super-markets. With a virtual reconstruction the privacy protected data (e.g., the face) can be eliminated. Therefore we reconstruct the presence of peo- ple in an environment, their motion, their trajectory, and their possible interactions without any individual or biometric information. In this way, security employers equipped with mobile devices can manage the multimedia information in real-time interacting with the environment and in a total respect of privacy issues. In this
chapter both the inter-media and the intra-media modalities are presented and dis- cussed. We assume to have a computer vision module able to extract a list of objects of interest. As described in Chapter 5, each object Oj detected in the time
t has a set of temporal attributes: position (x(t), y(t)) in the ground plane; the bounding box BB in the image plane, appearance image A(t), that is the color aspect, and so on. The objects recognized as a person inherit some additional at- tributes, that are:
Status(t) ∈ {moving, still}
P osture(t) ∈ {standing, sitting, crawling, laying}. (11.1) This extracted information are annotated in a text file, that is the basis of a standard annotation for stored video server that uses MPEG7 standard to keep informa- tion for historic and content-based search. The textual data can be downloaded, transmitted in a streaming manner over standard http-based connections to mo- bile clients, or used by the transcoding server to generate semantically compressed video streams.
11.2
Inter-media transcoding
11.2.1 3D virtual environment
In this section we directly present an application which consists of video surveil- lance of an indoor environment. First of all, the 3D virtual indoor environment has been built by using the JSR 184 software environments.
Figure 11.2: Input frame from a monitored room (left) and the output of the video surveil- lance module (right) corresponding to the virtual reconstruction of Fig. 11.4. Blobs, posi- tions, and postures of the detected people are super-imposed.
A frame of the reference room used in our experiments is reported in Fig. 11.2. The background model is a 3D model made of mesh objects that refer to static elements presents in the scene: table, chairs, cabinet with TV and pavement. For representing the scene we used two ambient lights sources (instances of the am- bient node class) in order to obtain a photo realistic rendering of the indoor envi- ronment. The human figures are instances of the correspondent scene graph node,
and include simple human textures. In fact a photo realistic representation of the human figures is less crucial (in this application case study) than the actions tak- ing places in the virtual environment. The application takes as input a textual data stream coming from the video surveillance extraction module. These data stream contains the extracted positions and postures of the tracked human figures. The 3D virtual environment reconstruction module takes these data and renders the human figures placing them in the environment according to their position and posture data.
The rendered postures are taken as classes from the extraction module, in the presented system four major classes are considered: standing, crawling, laying and sitting. In order to enhance the system performances the animations are executed at human figures positioning level and not on the models themselves. The system is already capable of implementing animations interpolations as already written in the past paragraph, but it is seems more reasonable to use them for other kind of domains like children surveillance (where the running speed could be a measure of the actual danger and it’s crucial to visually represent that data). The 3D virtual environment also has many advantages, especially the possibility of changing the point of view. We support three different points of view:
- Standard view: this view is basically the same as the calibrated cameras posi- tioned in the real environment. It is very useful in order to have a general view of the scene.
- Bird eye view: this view is very suitable for identifying the distances among the human figures and the objects in the scene. Spatial positioning is an important synthetic information to be used as a visual clue for detecting which actions are taking place in the environment.
- Interactive view: this view, not only presents the scene with a certain angle (usually the same as the standard view) but it is able to navigate the scene by using three different camera motion modalities. These modalities are: move (allow translations with respect to z and x planes), rotate (rotation among all the planes), and float (allow translations with respect to y and x planes).
The 3D virtual environment thus could represent and effective approach for easily visualize and detect people and action in a video-surveilled environment. The use of multiple synthetic points of view clearly helps users in identifying dif- ferent situations and by different angles of view. Since our system is extensible, many different points of view could be implemented, for instance for certain do- mains, a first person view could be useful for understanding the danger or damages occurred in the environment (elderly people surveillance, and care).
11.3
Software components
The software components used in our approach are based on M3G (Mobile 3D Graphics) library. This library consists of represents a high-level API (Application Programming Interface) for the implementation of graphics rendering on mobile
device. This choice leads us to develop a system in which the 3D scene rendering is carried out by on client side (mobile devices).
The mobile 3D graphics libraries used in our system are included in the JSR 184 standard specification, thus enabling the support by a wide set of devices; moreover this is a general approach because of the standard adoption of the M3G file format specification.
The M3G file format represents all the objects present inside a three-dimensional scene through the use of a tree structure called scene graph. Every node of this structure describes and defines any physical or abstract object of three-dimensional worlds (cameras, lights, meshes, animations).
Figure 11.3: Tree structure of the 3D virtual environment
The root of this tree is always a World object. Fig 11.3 shows a reduced version of the tree structure used for building the presented 3D virtual environment.
In order to simplify the diagram, in the above scene graph we have not inserted the Mesh objects that refer to static elements presents in the scene: table, chairs, cabinet with TV and pavement. In our scene model two different classes of lights are defined: spot and ambient. The Background object is represented by a back- ground node with a colour attribute in the format RGBA (Red, Green, Blue, and Alpha for transparencies). The interactive camera is a child of a Group object. This object has the task of managing the camera translation and rotation on the y axis. The camera object manages rotation on the x axis. With this separation all the changes on x axis don’t have effect on camera movements and rotation on y axis.
The more complex objects are the two MorphinMesh that contain all necessary data for the visualization and animation of classes of human figures (e.g., men, children). The MorphingMesh class allows to define models animated through morphing techniques: it requires to define the key models of an animation and an automatic interpolation procedure will compute shapes vertexes transformations needed for a smooth animation. By using the M3G file format and supporting the JSA 184 standards we noticed that the application jar file including models images and textures, testing data and source code does not exceed the 90kb.
More in detail M3G is a new standard file format, with extension m3g, used for storing all the scene graph data and the information for loading them in a custom
application. In this manner the data of a scene, comprised the animations, can be created using common existing three-dimensional modelling programs. However, there is not a standard approach for creating and exporting this kind of files. Thus we decided to use a particular technique to import models in obj (wide spread computer graphics format) data type.
Our approach for creating M3G files is inspired by Andrew Davison work [220], which is based on the Java3D API. The approach consists of converting three-dimensional models in a Shape3D object from which is possible extract the vertexes, normal vectors and coordinates of texture lists. These lists are then op- timized through the use of triangle strip. Finally a class is generated containing all the necessary methods to create and manage (starting from the acquired lists) a three-dimensional object in M3G.
11.4
Experiments
We provide several experiments in both indoor and outdoor environments. The real-time performance of the served-side depends on the number of cameras, the number of people and their size in the image space. In outdoor environments with cameras mounted in a high position tens of people can be captured from 4 cameras at about 10fps. The same performance is achieved indoor with 2 or 3 people as in the previous examples.
Figure 11.4: Example of different views of a3D virtual environment. a) standard view, b) interactive view, c) bird-eye view.
An example of the output of the indoor system is reported in Fig. 11.4, where the three different views (standard view, interactive view, and bird-eye view) cor- responding to the input frame of Fig. 11.2 are shown.
11.5
Intra-media Transcoding
11.5.1 Introduction
In this section a smart video server, implementing new semantic transcoding tech- niques to connect directly with PDA clients, is described. The transcoding model,
using the philosophy of MPEG-4 object-based compression, allows video stream- ing of part of the videos that are semantically valuable to the user: it compresses differently objects and event and operates also a temporal and size downscaling.
11.5.2 Related Works
Semantic transcoding is often employed in streaming severs to adapt the multi- media content and, specifically, reformat video code, in order to cope with user needs and user constraints [221]. Typically, variable compression, size, and color downscaling are useful to save bandwidth and deal with limited display resource of devices such as PDAs. The exploitation of semantics at this level means to use the knowledge of the video content to compress the video differently in the space and in the time. The MPEG-4 standard was the first structured proposal to han- dle differently background and foreground objects in the scene, to compress them and to send to the user for a specific encoding. At the same time, MPEG-4 Core Profile (needed for object functionalities) codec is computationally heavy and a real time decode and encode phase can now be achieved with hardware acceler- ator only. Thus, we proposed some simplified version that can be exploited also with software codec only [222], and in this section we describe a solution spe- cially conceived for PDA clients. The valuable contribute of this proposal is to define a semantic transcoding server operating on-the-fly in cascade with the com- puter vision system. In such a manner the knowledge of segmented people and objects guides the adaptive compression. In [223], a similar approach for traffic surveillance is proposed, where objects are segmented and compressed differently in MPEG-4 standard for an offline annotated video server. Further, in [224] not only an object transcoding is discussed, but also the semantics is exploited at level of both objects and events. This unified framework is here presented in a domotics