Anomalous vehicular traffic detection through spectral techniques and semi&un supervised learning models

Texto completo

(1)Instituto Tecnológico y de Estudios Superiores de Monterrey Campus Estado de México. School of Engineering and Sciences. Anomalous vehicular traffic detection through spectral techniques and semi&un-supervised learning models. A dissertation by. Roberto Carlos Vazquez Nava Submitted to the School of Engineering and Sciences in partial fulfillment of the requirements for the degree of. Master of Science in Computer Science. Atizapán de Zaragoza, Estado de México 23rd September, 2019.

(2) i. Instituto Tecnológico y de Estudios Superiores de Monterrey Campus Estado de México The committee members, hereby, certify that have read the dissertation presented by Roberto Carlos Vazquez Nava and that it is fully adequate in scope and quality as a partial requirement for the degree of Master of Science in Computer Science.. ————————————————– Dr. Miguel González Mendoza Tecnológico de Monterrey, Campus Estado de México Advisor. ————————————————– Dr. Oscar Herrera Alacántara Universidad Autónoma Metropolitana Committee chair. ————————————————– Dr. Miguel Angel Medina Pérez Tecnológico de Monterrey, Campus Estado de México Committee secretary. ————————————————– Dr. Rubén Morales Menéndez Dean of Graduate Studies School of Engineering and Sciences Tecnológico de Monterrey, Campus Monterrey. Atizapán de Zaragoza, Estado de México, September 23, 2019.

(3) Declaration of Authorship I, Roberto Carlos Vazquez Nava, declare that this thesis proposal titled, ’Anomalous vehicular traffic detection through spectral techniques and semi&un-supervised learning models.’ and the work presented in it are my own. I confirm that:. . This work was done wholly or mainly while in candidature for a research degree at this University.. . Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.. . Where I have consulted the published work of others, this is always clearly attributed.. . Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.. . I have acknowledged all main sources of help.. . Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.. Signed:. Date:. ©2019 by Roberto Carlos Vazquez Nava. All Rights reserved. ii.

(4) iii. Abstract Anomalous vehicular traffic detection through spectral techniques and semi&un-supervised learning models. by Roberto Carlos Vazquez Nava. The early development of the Internet of Things has allowed the construction of collaborative systems capable of responding effectively to events captured by sensors and devices. Also, it has given the ability to share information among themselves. This paradigm has opened up the development of initiatives such as Smart Cities. The main objectives of a Smart City are to improve the standards of people’s daily lives and to deal with the various urban problems to satisfy the needs of present and future generations; one of them is related to mobility. For example, poor transport management negatively impacts a city, as there is an increase of air and noise pollution, also in trip times for drivers, energy consumption and vehicular congestion. Among the range of problems related to mobility, we find the anomalous vehicular traffic. This problem can be caused by different reasons, for example, an accident, an event, road works, or a natural disaster. When performing the detection of this problem, it is possible to take short-term and long-term decisions, for example, by alerting drivers of the anomaly and allow them to make better decisions during their journey. Previous researches on the detection of AVT bases their solutions on using inductive loop sensors, video surveillance and crowdsourcing systems. However, these solutions are limited in the sense that they focus mainly on creating new algorithms, and do not pay attention to the underlying information that can be extracted from vehicular traffic. In this work, vehicular traffic is classify as a univariate time series where only there is a feature available. Therefore, feature extraction is relevant when information is scarce, since by having new features there is a significant contribution of information to enhance anomaly detection models, by facilitating their learning process and improving its performance. Ramp Metering is used at freeway on-ramps and regulates the vehicular traffic when entering to freeways according to current traffic conditions. With the use of this device, the improvement in vehicular congestion during rush hours has been demonstrated. In this time, the device is active. However, the problem arises when there is anomalous vehicular traffic in hours where Ramp Metering is deactivated and a good manage on the vehicular traffic is no longer possible. To address the lack of features by using univariate time series and the problem of using the Ramp Metering in extended hours, we propose a methodology for feature extraction to enhance.

(5) iv anomaly models to detect anomalous vehicular traffic at freeway on-ramps. As a first step, an algorithm for missing values imputation is proposed, followed by temporary, spectral, and aggregates features extraction, all of them in different time aggregations. To finally use unsupervised models and perform the anomaly detection in semi-supervised and unsupervised learning. The unsupervised models used in this work are: Isolation Forest, Local Outlier Factor, One-Class Support Vector Machine and Angle-Based Outlier Detection. The methodology was evaluated in a real vehicular traffic and synthetic databases. Experimental results show that the spectral, temporary and aggregation features enhance the detection of anomalous vehicular traffic. The Isolation Forest algorithm is compared with a literature’s algorithm based on Markov-modulated Poisson processes, and obtained the best performance..

(6) Acknowledgments. I appreciate the contribution of all the people who contributed directly or indirectly to the development of this thesis work; thank you all for the support provided. First, GOD, who enlightened me in difficult situations, and through faith, has allowed me to reach my goals. You will always be my light and support. My deep gratitude to my thesis advisor, the Dr. Miguel González Mendoza, for being my guide in this complicated process, thanks for his patience, dedication, motivation and the knowledge that he transmitted to me, laid the groundwork to carry out my thesis work. He made the difficult easy, it was a privilege to have his guidance and support. To my mother and family for supporting me in this journey, their confidence and motivation helped me to move forward with my dreams. Thank you for teaching me to work with discipline and effort; at the end there is a reward. To my classmates: Ismay, Germán, Javi, Miryam, Raúl, and Nicolas, with their moral support, their advice and accurate comments, contributed a high percentage to continue this project. Thank you for your friendship and share a common purpose. To Ismay, for your unconditional support, with your outlooks, comments, ideas, that you gave me during this journey, you always listened to my concerns and had an answer for them, I thank you not only for the help provided but also for the coexistence, companionship and above all for giving me your friendship. You are a great person and an excellent friend. To Soto, for pushing me into the adventure of studying a master, since without his intervention, I would not have embarked on this great project, thank you, friend. To my friends from Instituto Politécnico Nacional, Rodrigo, Soto and Ángel, who have always been with me in good times, in bad times and worse ones, your comments and words of encouragement helped me to continue, to show that the knowledge learned in our house of studies are invaluable. I would like to express my deepest thanks to the Instituto Tecnológico de Estudios Superiores de Monterrey, for giving me the opportunity to belong to such a distinguished institution. This house of studies opened the doors of his scientific bosom to study my master’s degree. I thank all my teachers who gave me their knowledge to continue with my preparation.. v.

(7) Contents Declaration of Authorship. ii. Abstract. iii. Acknowledgments. v. List of Figures. ix. List of Tables. xi. Abbreviations. xii. 1 Introduction 1.1 Motivation . . . . 1.2 Problem Statement 1.3 Hypothesis . . . . 1.4 Research Questions 1.5 Objectives . . . . . 1.6 Contributions . . . 1.7 Thesis Structure .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 2 Related Work 2.1 Infrastructure based on fixed sensors . . . . 2.2 Infrastructure based on surveillance cameras 2.3 Infrastructure based on Crowdsourcing . . . 2.4 Limitations of the revised proposals . . . . 3 Methodology 3.1 Overview of the methodology . . . . . . 3.2 Imputation process. . . . . . . . . . . . 3.2.1 Missing Data . . . . . . . . . . . 3.2.2 Missing values mechanisms . . . 3.2.3 Univariate time series imputation 3.3 Feature engineering . . . . . . . . . . . . 3.3.1 Temporary features. . . . . . . . 3.3.2 Spectral features. . . . . . . . . . vi. . . . . . . .. . . . .. . . . . . . .. . . . .. . . . . . . .. . . . .. . . . . . . .. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . methods . . . . . . . . . . . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . .. 1 1 3 5 5 6 6 7. . . . .. 8 8 9 10 11. . . . . . . . .. 13 13 14 14 14 15 17 18 18.

(8) Contents. 3.4. vii. 3.3.3 Aggregation features. . . . . . . . . . . . . 3.3.4 Feature selection. . . . . . . . . . . . . . . . Anomaly detection . . . . . . . . . . . . . . . . . . 3.4.1 Common anomaly detection techniques . . 3.4.2 Categories of anomaly detection algorithms 3.4.3 Selected anomaly detection algorithms . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 4 Experimentation and Results 4.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Methodology implementation . . . . . . . . . . . . . . . . . . . 4.3.1 Missing values imputation . . . . . . . . . . . . . . . . . 4.3.2 Features Extraction . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Temporary Features . . . . . . . . . . . . . . . 4.3.2.2 Spectral Features . . . . . . . . . . . . . . . . 4.3.2.3 Aggregation features . . . . . . . . . . . . . . . 4.3.3 Databases dimensions . . . . . . . . . . . . . . . . . . . 4.3.4 Feature selection . . . . . . . . . . . . . . . . . . . . . . 4.4 Features analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Generating and temporary features . . . . . . . . . . . . 4.4.2 Most relevant features . . . . . . . . . . . . . . . . . . . 4.4.3 Discarded features . . . . . . . . . . . . . . . . . . . . . 4.5 Anomaly detection results . . . . . . . . . . . . . . . . . . . . . 4.5.1 Time and bands analysis . . . . . . . . . . . . . . . . . . 4.5.1.1 Dodgers: lower and upper limits . . . . . . . . 4.5.1.2 Atizapan: lower and upper limits . . . . . . . . 4.5.1.3 Analysis of the results: lower and upper limits 4.5.1.4 Best AUC values from Dodgers databases . . . 4.5.1.5 Best AUC values from Atizapan databases . . 4.5.2 Models analysis . . . . . . . . . . . . . . . . . . . . . . . 4.6 Comparison between MMPP and iF . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. 20 20 22 22 22 23. . . . . . . . . . . . . . . . . . . . . . . .. 26 26 27 29 29 30 30 30 30 30 31 33 33 34 35 36 36 36 38 40 40 41 43 48. 5 Conclusions and Future Work 51 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53. A Digital filters A.1 Signals processing . . . . . . A.2 Filters . . . . . . . . . . . . . A.2.1 Systems . . . . . . . . A.2.2 Z-Transform . . . . . A.2.3 Transfer function . . . A.2.4 Laplace transform . . A.3 IIR Filter . . . . . . . . . . . A.4 Filter terminology . . . . . . A.5 Frequency Domain Responses. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 54 54 54 55 56 57 57 58 59 59.

(9) Contents A.6 Filter A.6.1 A.6.2 A.6.3. viii Design . . . . . . . . . . . Chebyshev Type I Filter . Frequency Transformation Bilinear transformation .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 61 61 63 64. B Features. 66. C Simulation with SUMO C.1 Map for simulation . . . . . . . . . . . . . . . C.2 Parameters to generate vehicular demand . . C.3 Normal and anomalous vehicular traffic setup C.4 Simulation . . . . . . . . . . . . . . . . . . . .. 73 74 74 76 77. Bibliography. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 78.

(10) List of Figures 1.1. Anomalous vehicular traffic example . . . . . . . . . . . . . . . . . . . . . . . . .. 3.1 3.2 3.3 3.4. The methodology proposed. . . . . . . . . . . . . . . . . Application example of literature’s imputation methods. Vehicular traffic scalogram. . . . . . . . . . . . . . . . . Anomaly detection categories. . . . . . . . . . . . . . . .. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8. 27 28 28 29 30 31 32. 4.16 4.17 4.18 4.19. One-week vehicular traffic of the Dodgers database. . . . . . . . . . . . . . . . . . Lecherı́a-Chamapa freeway on-ramp. . . . . . . . . . . . . . . . . . . . . . . . . . One-week vehicular traffic of Atizapan database. . . . . . . . . . . . . . . . . . . Missing values from Dodgers database. . . . . . . . . . . . . . . . . . . . . . . . . Imputed missing values example by the proposed algorithm. . . . . . . . . . . . . Scalogram from Atizapan database. . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral features extracted from Dodgers database. . . . . . . . . . . . . . . . . . Comparison of AUC values in semi-supervised learning by using all bands and none of them from Dodgers databases. . . . . . . . . . . . . . . . . . . . . . . . . Comparison of AUC values in unsupervised learning by using all bands and none of them from Dodgers databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of AUC values in semi-supervised learning by using all bands and none of them from Atizapan databases. . . . . . . . . . . . . . . . . . . . . . . . Comparison of AUC values in unsupervised learning by using all bands and none of them from Atizapan databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . The plot shows the best AUC values obtained in semi-supervised learning with Dodgers databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The plot shows the best AUC values obtained in unsupervised learning with Dodgers databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The plot shows the best AUC values obtained in semi-supervised learning with Atizapan databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The plot shows the best AUC values obtained in unsupervised learning with Atizapan databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CD diagrams results by learning and databases. . . . . . . . . . . . . . . . . . . . CD diagrams with the statistical comparisons to know which algorithms are better. Comparison between MMPP and iF by using Dodgers databases. . . . . . . . . . Comparison between MMPP and iF by using Atizapan databases. . . . . . . . .. A.1 A.2 A.3 A.4. A digital system. . . . . . . . . . . . . . . . . . . . . . . . . . . . LTI system with Z-transforms. The input X(z) and output Y (z) Response and characteristics of the studied filters. . . . . . . . . Bandpass filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55 56 60 61. 4.9 4.10 4.11 4.12 4.13 4.14 4.15. ix. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 3 14 16 19 23. 37 38 39 39 41 41 43 46 48 49 49 50.

(11) List of Figures C.1 C.2 C.3 C.4. Simulation process flow diagram. . . . . . . . . . OpenStreetMap and SUMO maps. . . . . . . . . Urban traffic density. . . . . . . . . . . . . . . . . Anomalous vehicular traffic simulation example.. x . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 74 74 75 77.

(12) List of Tables 2.1. Related works which use inductive loop sensors. . . . . . . . . . . . . . . . . . . .. 12. 3.1 3.2. Parameters of the implemented digital filters. . . . . . . . . . . . . . . . . . . . . Aggregation features used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20 21. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8. Dimensions of databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time it took to execute feature extraction and feature selection processes. . . . . The relevance of Generating and temporary features. . . . . . . . . . . . . . . . . Most appeared features in databases. . . . . . . . . . . . . . . . . . . . . . . . . . Aggregation features generated by spectral features. . . . . . . . . . . . . . . . . The ten most generated aggregation features. . . . . . . . . . . . . . . . . . . . . Discarded features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating and temporary features from Dodgers databases with which the algoritms obtained the best AUC values trained in semi-supervised learning. . . . . Generating and temporary features from Dodgers databases with which the algoritms obtained the best AUC values trained in unsupervised learning. . . . . . Features that most appear at Dodgers databases with the best AUC values in semi-supervised learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aggregation features that most appear at Dodgers databases with the best AUC values in unsupervised learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating and temporary features from Atizapan databases with which the algoritms obtained the best AUC values trained in semi-supervised learning. . . . . Generating and temporary features from Atizapan databases with which the algoritms obtained the best AUC values trained in unsupervised learning. . . . . . Aggregation features that most appear at Atizapan databases with the best AUC values in semi-supervised learning. . . . . . . . . . . . . . . . . . . . . . . . . . . Aggregation features that most appear at Atizapan databases with the best AUC values in unsupervised learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results obtained by the Wilcoxon test for iFsemi algorithm. . . . . . . . . . . . . Results obtained by the Wilcoxon test for ABODsemi algorithm . . . . . . . . . Results obtained from comparison between iF and MMPP. . . . . . . . . . . . . .. 31 33 34 34 35 35 36. C.1 Population and vehicular flow parameters. . . . . . . . . . . . . . . . . . . . . . . C.2 Population age parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Working time parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75 76 76. 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18. xi. 42 43 44 44 45 46 47 47 47 47 49.

(13) Abbreviations AVT. Anomalous Vehicular Traffic. EUTS. Equispaced Univariate Time Series. AUC. Area Under the Curve. IoT. Internet of Things. ICT. Information and Communication Technologies. SC. Smart City. SM. Smart Mobility. RM. Ramp Metering. NOCB. Next Observation Carried Backward. LOCF. Last Observation Carried Forward. FFT. Fast Fourier Transform. OCSVM. One-Class Support Vector Machine. iF. Isolation Forest. LOF. Local Outlier Factor. ABOD. Angle-Based Outlier Detection. MMPP. Markov-modulated Poisson processes. PCA. Principal Component Analysis. WSN. Wireless Sensor Network. ITS. Intelligent Transportation System. GPS. Global Positioning System. CD. Critical Difference. LTI. linear time-invariant. TF. transfer function. FIR. Finite Impulse Response. IIR. Infinite Impulse Response xii.

(14) Chapter 1. Introduction 1.1. Motivation. With the early development of the Internet of Things (IoT), how systems and businesses operate has been revolutionized [11], as it allows a more integrated communication environment where data can be easily shared. It also eases the automation and control of different processes and services [22] and allows to build up collaborative systems capable of effectively responding to a captured event from sensors and devices [3], applications [89], and digital platforms such as social networks [55]. Also, thanks to the IoT, initiatives such as Smart Cities (SC) have become relevant [17, 67, 79, 80]. Therefore, the IoT is a valuable source to meet the needs of urban life. The research community has discussed the definition of an SC, but it is not yet fully defined in the literature. An approach that attempts to encompass the main features and characteristics of an SC is mentioned by Moustaka [59], who defines it as an innovative city that through the use of Information and Communication Technologies (ICT), seeks to improve both the people’s quality of life who inhabit it and the efficiency of operations. It also offers new and better urban services to citizens and environmental protection. There are terms in the literature that are related to the concept of SC, such as industry, education, participation, quality of life, natural resources preservation, technical infrastructure, social and human capital [33]. These concepts involve different fields that comprise a city, in this way, the researchers have segmented the SC into six dimensions to define indices that can measure urban intelligence, and conduct studies in a distributed way. The dimensions are [59]:. 1.

(15) Chapter 1. Introduction. 2. Smart Economy, Smart Mobility (SM), Smart Environment, Smart People, Smart Living, and Smart Governance. The urban acceleration is the backbone of problems faced by the dimensions of an SC. It is estimated that by 2050, 70% of the population will live in cities [17], which will impact the citizens’ quality of life and placing new, intense pressures on city resources and infrastructure. Population growth brings the intensification of existing problems and the introduction of new ones in the cities. For example, in mobility, poorly managed transport has a negative impact on an urban area, because it increases air and noise pollution, trip times for drivers, energy consumption, and vehicular congestion. Also, in many cities, currently mobility systems are already inadequate, but urbanization and increasing population will increase demand still further. All these problems cause an imbalance in the daily life of the inhabitants in the city, the economic and environmental wellbeing. So, mobility is and will be one of the toughest challenges faced by city governments around the globe [24]. Nowadays, mobility involves most of the activities that take place within a city, and it has been valuable since the beginning of urbanization for the development of society in all aspects. Therefore, the SM gains relevance to overcome mobility problems, as it is the motor to guide the development and adoption of sustainable strategies that respond to current and future challenges, also seeks to meet urban requirements within a city for a safer, more efficient and sustainable mobility [59]. That is why the SM is identified as one of the most promising dimensions in an SC [7], as contributes to the reduction of environmental and social impacts, and greater efficiency in the transport system by benefiting travelers, transport operators, urban planners and city governments. One of SM’s objectives is traffic management [54], which guide and control both stationary and moving traffic. This objective is key nowadays since the number of cars in cities has increased, such as Moscow, Istanbul, Bogota, Mexico City, and Sao Paulo [39]. Governments, companies, and researchers have proposed many solutions to solve problems caused by an excess of vehicles [57], which range from the creation of policies that restrict traffic zones for time zones or periods of the day [20, 84], through the expansion and construction of road infrastructure, to the implementation of algorithms for improving circulation in an urban area supported by devices such as inductive loop sensors, video surveillance and crowdsourcing systems. Among the problems related to the excess of vehicles, a particular one is the anomalous vehicular traffic (AVT), which is an unexpected change in the day-to-day vehicular traffic, and SM has.

(16) Chapter 1. Introduction. 3. to solve it. To provide a solution to detect AVT, we must bear in mind that the distribution of traffic is different in the roadways of a city, due to previous researches focus their solutions on a specific dataset. Therefore, this dissertation proposes a solution for the detection of AVT at freeway on-ramps and thus contribute to SM solutions. We detail in the next section the problem studied in this work.. 1.2. Problem Statement. AVT is vehicular traffic that deviates significantly from the flow established as normal. Figure 1.1 shows an example of this concept. An accident, a celebration, an event, road works, or a natural disaster are causes of the presence of AVT [92]. Its detection allows taking shortterm and long-term decisions. The first one, by alerting the drivers in advance about the road status will help in minimizing the opportunity of occurring traffic congestion, also by allowing road users to make better decisions during their journey and thus decrease the transfer time. The second one, like Zhou, Meerkamp, and Volinsky [98] mention, by having quantifications of anomalous traffic can help urban planners to monitor, analyze, and modify traffic systems. For example, to identify high-impact infrastructure projects, by removing bottlenecks in the road system, evaluating public transportation capacity and routes, minimizing the impact of road work, optimizing traffic light and Ramp Metering scheduling.. Cars count. 600. 400. 200. 0 0. 5. 10. 15. 20. 25. Time (Hours). Figure 1.1: AVT example point out in red between 21:00 and 23:00 hrs. The vehicular traffic pattern deviates significantly from the gray one considered as normal vehicular traffic. The challenge is to detect when AVT starts as soon as possible.. It is necessary to know the characteristics of the different roadways, the areas where they are, and the type of transport that circulates through them to understand vehicular traffic phenomenon. For example, the distribution of vehicular traffic throughout the day in a residential area will not be the same as either an industrial or a commercial area. Also, the type of activities carried.

(17) Chapter 1. Introduction. 4. out in the different zones will influence vehicular traffic behavior. According to the Federal Highway Administration (FHWA), a classification of roadways based on the degree of mobility and access is considered as follows [63]:. Arterials: It is a high-capacity urban road. Provide a high degree of mobility for the. longest uninterrupted distance, with some degree of access control. Collectors: Provides a less highly developed level of service at a lower speed for shorter. distances by collecting traffic from local roadways and distribute it to arterials. Locals: These roads should be accessible for public use throughout the year, also, carry. low volumes of traffic at a lower speed for shorter distances.. Previous researches perform the detection of AVT in any of the roads mentioned. Different infrastructure technologies are used to collect data of vehicular traffic, such as inductive loop sensors, surveillance cameras, Global Positioning System (GPS), and social networks. However, these solutions are limited in the sense that they focus mainly on creating new algorithms. Also, they do not pay attention to the underlying information that can be extracted from vehicular traffic, especially when the devices used do not have the possibility to collect features such as the number of lanes, access points in the roadways, the type of transport which circulates as bicycles, cars, buses, trucks, even pedestrians and the current weather in the zone [86]. All of them affect the vehicular traffic and are useful to detect AVT. In this work, vehicular traffic is classified as a univariate time series, where the unique feature is the cars counted and implicitly the time when the observations are recorded. To extract underlying information from a univariate time series, the application of spectral techniques such as digital filters are possible, because vehicular traffic is a discrete signal. Therefore, the feature extraction is relevant when information is scarce, since by having new features, there is a significant contribution of information to anomaly detection models, by facilitating their learning process and improving its performance. When vehicular traffic is analyzed, not only the arterials and collectors provide relevant information, but also the interchanges, which are grade-separated intersections of roads that use structures to separate conflicting streams of traffic. The main problem presents at interchanges is when a car enters the mainline immediately before another takes an exit, by creating conflict known as weaving [61], and it affects vehicular traffic. A solution is the use of Ramp Metering.

(18) Chapter 1. Introduction. 5. (RM) [77]. This device is a ramp management strategy to control the number of vehicles entering a freeway by using a traffic signal. This solution has presented good results, by reducing vehicular congestion and the number of accidents that occur in these interchanges [48]. A problem found at California freeways on-ramps is regarding RM operation, which is typically activated on a time-of-day basis, for example, peak traffic periods, regardless of traffic conditions. As Lu et al. [51] mention, some Caltrans Districts 1 operate RM for extended hours beyond the peak periods, but there is no guideline for RM activation based on freeway conditions. Also, they emphasize that there is a necessity to systematically evaluate the potential benefits of extending the current peak period RM operating. To address this problem, they propose CRRM, a series of recommendations to extend the use of RMs beyond peak hours. So, it becomes relevant to automate the operation of the RM by detecting anomalous traffic at freeway on-ramps outside its operating cycle and thus enhance the CRRM.. 1.3. Hypothesis. Due to the lack of a preprocessing of data when carrying out the detection of AVT, coupled with the importance of having RM active in a freeway on-ramp, the following hypothesis is considered: The development of a methodology to extract spectral, temporary and aggregation features of the vehicular traffic in a freeway on-ramp, and with the use of unsupervised learning algorithms will be carried out the anomalous vehicular traffic detection with a statistically significant difference in both semi-supervised and unsupervised learning.. 1.4. Research Questions. The research questions that arise to solve our hypothesis are: What features provide the most information for the detection of anomalous vehicular. traffic? What is the time aggregation interval in which the detection of anomalous vehicular traffic. is substantial and with acceptable results? 1. https://dot.ca.gov.

(19) Chapter 1. Introduction. 6. Which type of learning will be better for the unsupervised algorithms proposed to enhance. the anomalous vehicular traffic detection?. 1.5. Objectives. The general objective of this dissertation is the development of a methodology for the extraction of spectral, temporary and aggregation features of the vehicular traffic at a freeway on-ramp and with the use of unsupervised outlier algorithms to perform the detection of anomalous vehicular traffic with a statistically significant difference in both semi-supervised and unsupervised learning. From this general objective, the following particular objectives are derived:. Propose an algorithm for the preprocessing of the vehicular traffic dataset in case there. are missing values to boost the anomaly detection algorithms. Determine the features that best describe the datasets used and that improve anomalous. vehicle traffic detection. Evaluate the results of the unsupervised algorithms and select the ones with the best. results according to a statistical test.. 1.6. Contributions. This work is distinguished by the proposal of a methodology that includes:. Temporal, spectral, and aggregation features extraction from vehicular traffic to enrich the. univariate time series dataset at different time aggregations. To the best of our knowledge, it is the first time that spectral techniques are used to detect AVT. An imputation algorithm, which is based on the history of vehicular traffic values that. match with the day and time at which the missing values are recorded. It was developed to not lose information, preserve the continuity of the signal when calculating the aggregation features at different time aggregations, and enhance the anomaly detection models used..

(20) Chapter 1. Introduction. 7. A study of well-known anomaly models applied to the detection of AVT by using two. different learnings: semi-supervised and unsupervised.. 1.7. Thesis Structure. The rest of this document is organized as follows. In the next chapter, the related work is explained. In chapter 3, the proposed methodology is described, followed by the experimentation carried out, and the results in chapter 4. While in chapter 5 the conclusions and future work are presented..

(21) Chapter 2. Related Work To detect AVT, typically an Intelligent Transportation System (ITS) employs various types of infrastructure-based technologies into vehicles, people, and roadways, for monitoring vehicular traffic [85]. The technologies that are used to acquired traffic information are inductive loop sensors, surveillance cameras, social media and multiple devices such as mobiles, which have multiple sensors and deliver real-time information on the position, speed and audio. Several solutions have been proposed by the researchers to detect AVT. The following sections summarize the main solutions that have been proposed in the literature to solve this problem according to the main types of infrastructure-based technologies.. 2.1. Infrastructure based on fixed sensors. Sensors are a key element for the acquisition of environmental information, e.g., vehicular traffic. In this section, we will focus on those works whose proposal is based on sensors placed in a fixed place, such as inductive-loop sensors [60]. The information collected by each sensor can be analyzed individually or together, to understand more deeply the phenomenon under study, in this work is the detection of AVT. The inductive loop sensor allows counting cars and with an analysis of this information, the detection of AVT. An example is the proposed work by Ihler, Hutchins, and Smyth [37], who perform anomaly detection by using Markov-modulated Poisson processes (MMPP). While Jawad, Kersting, and Andrienko [40] perform anomaly detection using biological sequence and profile 8.

(22) Chapter 2. Related Work. 9. hidden Markov models. Other related work is done by Saarinen et al. [73], who conduct a study of LODA, an assembly for the detection of anomalies, within their work they use several temporary databases, one of them related to the detection of AVT. Guo, Wei, and Billy [35], detected anomalies using short-term traffic flow forecasting based on time-varying conditional variance modeling of the traffic flow series, which is seasonal ARIMA combined with GARCH. On the other hand, Thomas and Van Berkum [83] perform the prediction for recurrent events and detection for incidents based on links flows data that were collected at urban intersections in the Dutch city of Almelo.. 2.2. Infrastructure based on surveillance cameras. Nowadays, computer vision has gained popularity in the research community [45]. Visual data contains rich information, and it can play a vital role by detecting AVT. For example, Babaei [5] proposed a method by using a video surveillance system in an intersection by means of vehicles trajectory analysis based on Support Vector Machine (SVM), while Singh and Krishna [78] use deep autoencoders. Works where anomalous vehicular detection was performed, through the application of outliers algorithms to a video dataset, are k-Nearest Neighbors by Dang, Ngan, and Liu [19], LOF by Ma, Ngan1 and Liu [52], Bayes and GMM methods by Lam et al. [46]. All these works performed the video conversion to spatial-temporal traffic signals. Then, these signals are transformed into a two-dimensional (2D) by using Principal Component Analysis (PCA). After, one of the following outlier detection algorithms is applied to detect AVT. The work proposed by Farooq, Khan, and Ali [27] explores multidimensional data of road traffic to detect anomalies. Features like velocity, change of lane, vehicle trajectories can be efficiently to analyze track vehicles, by using a single camera node for detection of pattern to detect anomaly using DBSCAN clustering. Other work, where features are used, such as relative speed, inter-vehicle time gap, and lane changing, is proposed by Barria and Thajchayapong [6], which algorithm is based on spatio-temporal changes in the variability of these features. Roy, and Bilodeau [72] proposed a work based on deep learning. They use a deep autoencoder model coupled with a data augmentation method. This algorithm allows encoding information about normal trajectories while removing irrelevant information..

(23) Chapter 2. Related Work. 2.3. 10. Infrastructure based on Crowdsourcing. With the advent of the GPS and extensive use of smartphones, vehicles trajectory can be shared. Also, people tend to post messages related to vehicular traffic problems through social media platforms. By monitoring and analyzing these rich and continuous content from vehicular and people, AVT can be detected. For instance, Liu et al. [50] analyzed GPS trajectories generated by taxies in Beijing to detect anomalies in traffic flow comparing its current traffic flow with not only its historical data but also its neighbors of roads close in geo-distance and terms of traffic patterns. The following works base their solution using taxies’ GPS trajectories as well. Kuang, An, and Jiang [44] proposed a solution to detect traffic anomalies based on PCA and with the combination of the wavelet transform. Similarly, Li et al. [47] seek to detect outlier behavior in the set of road segments of the traffic data and not individual moving objects; they use agglomerated temporal information. At each time step, historical similarity values are updated using a reward rule. While Zhang et al. [94] detect social events, they extract human flow patterns in urban areas with taxies pick-up and drop-down number and proposed a method to evaluate the social activeness of city-scale regions. Xing et al. [90], proposed a model of constructing an anomalous directed acyclic graph based on spatial-temporal density to detect outliers. Pan, Zheng, Wilkie, and Shahabi [64], first detect the spatio-temporal scope of a traffic anomaly based on taxies’ GPS trajectories and then try to describe the anomaly by using social media that have been generated in the spatial and temporal scope. With the use of social media such as Twitter, D’Andrea, Ducange, Lazzerini, and Marcelloni [18] present a real-time monitoring system for traffic event detection. The system fetches tweets and processes them, by applying text mining techniques, and finally assign the appropriate class label to each tweet, as related to AVT or not. Vij and Aggarwal [85] based their solution on crowdsourced acoustic data captured from multiple smartphones. Zhou, Meerkamp, and Volinsky [98] focus their solution on anonymized cellular data to infer traffic flow and detect anomalies..

(24) Chapter 2. Related Work. 2.4. 11. Limitations of the revised proposals. Solutions based on mobiles and GPS involve problems with energy, they dependent on battery life, also privacy is an issue, without having the correct licenses, it is difficult to access to the information. Another problem with these devices is that if we do not obtain information from all the actors in vehicular traffic, we cannot be sure that our data is representative and describes all the possible trajectories. The same problem is present in social media because if there are no users who publish anomalous events, it is not possible to detect AVT. Surveillance cameras, although they can perform feature extraction of the environment that is focused, involve huge computing resources consumption. Also, their infrastructure may only monitor a limited area, and their installation and maintenance require lane closure. The performance of these devices are sensitive to bad weather, vehicle shadows, and dust on the camera lens. They require specific camera mounting height for finest vehicle presence detection, and speed measurement [60]. Inductive loop sensors are the most commonly used sensors in traffic applications. However, the installation and maintenance of these sensors require pavement cut and lane closure. Many loop sensors are required to cover a location, and the detection accuracy drops with vehicle classes. Also, this type of sensors cannot detect other characteristics of the environment, only the count of cars. An advantage of inductive loop sensors compared to GPS is that they are reliable data sources, by capturing all vehicles passing by [70]. Also, they are unresponsive to lousy weather [60]. Besides, the infrastructure of these sensors is already in various points of the roadway system, such as in California freeways, where sensor network has 25,000 inductive loop sensors. The sensor network is virtually the only source of data for using in traffic operations, performance measurement, planning and traveler information to make freeway operational decision [70], which makes it a robust and information-rich system. In this dissertation, we will focus on to detect the AVT through the information acquired by inductive loop sensors. Table 2.1 shows the descriptions of such works based on this type of sensors. The reviewed works focus more on the development of an algorithm that allows the detection of AVT than in the preprocessing of the database used and underlying information than can.

(25) Chapter 2. Related Work. 12. Table 2.1: Related works which use inductive loop sensors to detect anomalous vehicular traffic. It is detailed the type of technique used and what type of data preprocessing has been performed.. Reference. Technique. Roadway. Feature Extraction. Imputation method. Ihler et al. [37] Jawad et al. [40] Guo et al. [35] Thomas et al. [83]. Probabilistic Statistical Autoregressive Statistical. on-ramp on-ramp highway intersections. No No No No. Yes No Yes No. be extracted from the vehicular traffic signal. None of the works from Table 2.1 have extracted features from vehicular traffic. About imputation methods, Ihler et al. [37] proposed one by using Poisson distribution; however, the method imputes days where most of their observations are missing values. Meanwhile, Guo et al. [35] based their solution on the SARIMA model, the main drawback of this model is that it cannot display anything if there is no previous information. Jawad et al. [40] based their solution on learning a profile HMM from the event data (AVT), which does not follow semi-supervised or unsupervised learning, these concepts will be detailed in the next chapter. Thomas et al. [83] base their work on spatio-temporal vehicular traffic to detect anomalies. The main disadvantage of this work is the strong dependence with other vehicular traffic flows. In general, our proposal is based on missing values imputation present in the databases under study, which are generated by an inductive loop sensor. We will impute the missing values if they satisfy a criterion explained in the next chapter, with the purpose of not losing information when making different time aggregations. Subsequently, the temporary, spectral, and aggregation features of vehicular traffic are extracted and thus enrich the information. Finally, the detection of AVT is carried out by comparing the best results with those obtained by the MMPP algorithm. This proposal can be seen as a general solution to any type of univariate time series with anomalies. However, The objective of this work focuses solely on the detection of AVT..

(26) Chapter 3. Methodology 3.1. Overview of the methodology. An equispaced univariate time series (EUTS) consist in simple observations o1 , o2 , o3 , ..., on that are sequentially recorded in equal time increments t1 , t2 , t3 , ..., tn , so that the following condition is hold |t1 −t2 | = |t2 −t3 | = ... = |tn−11 −tn | [58]. Although the set of observations of a univariate time series is generally given as a single column of numbers, the time is an implicit variable in the time series [16]. Some phenomena that belong to the type of time series mentioned above are macroeconomic series, the monthly concentration of CO2 , monthly passengers traveling on an airline, and annual sales of some product. The nature of these phenomena is an equispaced univariate time series. Therefore, spectral techniques can be applied to it. Among these spectral techniques is the application of digital filters. Another phenomenon that can be added to the list is vehicular traffic. Thus, it is possible to apply it filtering techniques to obtain new signals, which will play a role as features to improve the performance of algorithms in the detection of anomalies, although an improvement in the prediction is not guaranteed [97]. We propose the methodology for features extraction to detect AVT depicted in Figure 3.1. In the following sections, the missing values imputation algorithm is detailed, followed by the temporary, spectral, and aggregation features extraction. Then, the feature selection process is explained, and finally, the detection of AVT is carried out by using trained algorithms in semi-supervised and unsupervised learning.. 13.

(27) Chapter 3. Methodology. 14. Figure 3.1: Flowchart of the methodology proposed for the feature extraction to detect AVT.. 3.2 3.2.1. Imputation process. Missing Data. The problem of missing values is relatively common in almost all researches and can have a significant effect on the conclusions that can be drawn from the data [41]. Also, they can drastically impact the quality of machine learning models that are used. An alternative is to omit the cases with the missing values and to analyze the remaining data. However, there is a risk of losing valuable information. Therefore, missing values must be replaced with reasonable values. In statistics, this process is called imputation [58].. 3.2.2. Missing values mechanisms. Missing values occur for reasons beyond our control. In the collection of information, sometimes the data analysts are unaware of the reasons why data may have been lost. However, for purposes of analysis, assumptions are made about why data is missing [38]. Depending on what causes missing data, the gaps will have a specific distribution. Understanding this distribution can be useful because it can be used as background knowledge to select an appropriate allocation algorithm. When talking about the mechanisms of disappearance, three terms arise: Missing Completely at Random (MCAR), Missing not at Random (MAR), and Missing at Random (MNAR). While the diagnosis of MAR and MNAR requires manual analysis of the patterns in the data and.

(28) Chapter 3. Methodology. 15. knowledge of the domain. The three missing data generation models are explained as follows [34]: MCAR: Any variable present does not cause the absence of the information. It means. that the information about the whole database can be estimated from any of the missing values patterns. MAR: The presence of missing values is independent of the values of the same variable. but is dependent on the values of other variables in the database. NMAR: The presence of missing values is dependent on the values of the variable.. For the EUTS, the missing data mechanisms look slightly different. At first glance, there is only one variable in the data. However, as mentioned above, the time has to be treated as a variable when determining the missing data mechanism of a database. For the MCAR mechanism, there is no dependence between the time and a missing observation, for instance, a sensor that sends data to a system that for some unknown and random reason, the sending of information fail at times. On the other hand, for the MAR mechanism, the probability of missing an observation depends on the time in which the observation was recorded, this can be seen when some sensor presents missing values in a particular hour of the day, while in NMAR, the probability can (but not necessarily) depend on other variables [58].. 3.2.3. Univariate time series imputation methods. The vehicular traffic data which will be studied in this work is acquired from an inductive loop sensor, and presents faults in a random way. There is no dependence on time or vehicle traffic per se, so that, missing values can be cataloged within the mechanism MCAR. Some imputation methods of EUTS for this type of mechanism are explained below [58]:. 1. Mean, mode, and median: It replaces the missing values with the mean, the mode or the median; however, when using any of the three imputation methods, the variance is reduced. 2. Next Observation Carried Backward (NOCB): Replace each missing value with the most recent previous value that is not a missing value. Its main disadvantage is when there are large differences between the observation at the time tn and its predecessor at tn−i ..

(29) Chapter 3. Methodology. 16. 3. Last Observation Carried Forward (LOCF): Replace each missing value with that most recent subsequent value that is not a missing value. Its main disadvantage is the same as NOCB. 4. Kalman filter : Compute the interpolation using the Kalman filter. The disadvantage is that if there are many contiguous missing values, it does not compute a reasonable imputation. 5. Interpolation: The calculation of the missing value is computed using two points, the previous and the nearest ones that are not missing values. However, it does not give acceptable results when there are contiguous missing values.. Figure 3.2 shows the application of imputation methods on one day of vehicular traffic (September 14, 2005) depicted at 3.2 (a). This example is extracted from the Dodgers database used by Hutchins et al. [9] and contains contiguous missing values colored in blue. Figure 3.2 (b) depicts the pattern of normal data from the Dodgers database. It is noted that the EUTS methods explained above are not useful for this missing values configuration because they do not follow the pattern of the normal vehicular traffic.. 40. Cars Count. Cars Count. 50. 30 20 10. 30 20 10. 0 2005-09-14 04:00:00. 2005-09-14 10:00:00. 2005-09-14 16:00:00. 2005-09-14 22:00:00. 0. 5. Time. 20. 25. (b). 50. 50. 40. 40. Cars Count. Cars Count. 15. Time. (a). 30 20 10 0. 30 20 10 0. 2005-09-14 04:00:00. 2005-09-14 10:00:00. 2005-09-14 16:00:00. 2005-09-14 22:00:00. 2005-09-14 04:00:00. Time. 2005-09-14 10:00:00. 2005-09-14 16:00:00. 2005-09-14 22:00:00. 2005-09-14 16:00:00. 2005-09-14 22:00:00. Time. (c). (d). 50. 50. 40. 40. Cars Count. Cars Count. 10. 30 20 10 0. 30 20 10 0. 2005-09-14 04:00:00. 2005-09-14 10:00:00. Time. (e). 2005-09-14 16:00:00. 2005-09-14 22:00:00. 2005-09-14 04:00:00. 2005-09-14 10:00:00. Time. (f). Figure 3.2: Application example of literature’s imputation methods. (a) Original signal where its missing values are colored in blue. (b) The normal data pattern from Dodgers database. The remaining figures (c-f) show the EUTS methods with values imputed colored in red. (c) Kalman, (d) LOCF, (e) Mean and (f) interpolation. The values imputed by EUTS methods do not follow the vehicular traffic normal pattern..

(30) Chapter 3. Methodology. 17. To solve the problem presented by the previous techniques, we proposed the algorithm 1. This algorithm imputes the days that have missing values from the database under study, only those days which have a number of missing values less than or equal to a threshold. If the amount of missing values of a day is greater than the threshold, the observations of the day are removed from the database. The proposed threshold value is defined as follows:. thr = n(1 − P rσ ). (3.1). Where n is the number of observations in a day, and P rσ is 0.6827, which represents the percentage of data that should be within a normal distribution with a standard deviation of the mean. This value accepts to have approximately 30% of missing values per day, which is for a signal whose sampling time is 5 minutes, having less than 92 missing values at most. If the previous threshold is held for a given day, each of its missing values will be imputed with the average of non-missing car count values from other weeks that coincide with the time and day in which the missing value was registered. The algorithm fulfills its function because the vehicular traffic has a strong seasonal pattern, i.e., the behavior of the vehicular traffic is very similar between weeks, and in the same way between weekends, so the imputation method is reliable.. 3.3. Feature engineering. From the vehicular traffic database, the unique available feature is the vehicle count; from now on, we will refer to it like cars. Thus, if we want to detect AVT, we need to create more features that help us to improve anomaly detection algorithms, so three phases are performed for the vehicular traffic feature extraction. The first one consists in to extract the temporary features, which are related to the hour, day, and month when the observations were registered. The second phase is the compute of spectral features by digital filtering, and the last phase, the aggregation features will be extracted. These features are values aggregations, and by applying a function, a representative value is computed..

(31) Chapter 3. Methodology. 18. Algorithm 1: Imputation algorithm. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. 16 17 18 19 20 21. Input: db: Database grouped by days. thr : maximum number of missing values allowed. Output: db: Database with imputed values. foreach day ∈ db do qtymv ← CountMissingValues(day); if qtymv = 0 then NextDay; if qtymv > thr then DeleteDayDatabase(day); else foreach hour ∈ day do if cars count at hour is missing value then days ← SelectDays from db ⇐⇒ days = day; cars count ← SelectCarsCount from days at hour ⇐⇒ cars count are not missing values at hour; impval ← Average(cars count); day at hour ← ImputateValue(impval); end db ← UpdateValues(day); end end. 3.3.1. Temporary features.. These features are based on the date and time data recorded for each observation and are implicit in the database. The temporary features are mentioned below:. Hour of the day (H): corresponding to the time (hh:mm) in which the observation was. detected, H ∈ [0, 24). Month Number ∈ [1, 12]. Day number of the year ∈ [1, 365]. Day number of the week ∈ [1, 7].. 3.3.2. Spectral features.. As is well-known, the purpose of the filters is to separate the information from interference, noise, and unwanted distortion [15]. However, in this dissertation, the digital filters are used to extract signals from vehicular traffic and will play a role of features..

(32) Chapter 3. Methodology. 19. Figure 3.3: Vehicular traffic scalogram with a time sampling of 5 minutes. The band frequencies bounded by dotted lines are shown. In total, there are five bands, Band 1 corresponds to the lower band, and so on until Band 5, which corresponds to the upper band.. For the feature extraction, three types of filters are used: lowpass, bandpass, and highpass, all of them Chebychev Type I. To have a better understanding of the signal and thus be able to determine the cut-off frequencies of the different filters implemented, a scalogram with the wavelet Morlet was used [21] Figure 3.3 shows a scalogram of the signal of the vehicular traffic sampled at 5 minutes, the purpose of plotting a scalogram is separate those signals where the vehicular traffic has higher energy. The signals are bounded by dotted lines, which represent the cutoff frequencies. From now on, the filtered signals will be called bands. The following filters will be used: one lowpass, one highpass, and three bandpasses to extract the bands. The five proposed bands were selected in such a way that the whole range of the frequency of the signal was included in a range f ∈ (0, fN ], where fN is the Nyquist frequency. Analog filter methodologies are used to design digital filters, as it was done in this dissertation. The bilinear transformation was used to convert from an analog prototype filter to digital filter. The parameters values used in the design of the five filters are the passband ripple (Rp ), the stopband attenuation (Rs ), the passband corner frequencies (fp ) and stopband corner frequencies (fs ). Their values are summarized in Table 3.1, as well as the minimum order calculated for each of them. We follow the suggestions for digital filter design proposed by [87]. All the proposed parameters satisfy the design process described in Appendix A..

(33) Chapter 3. Methodology. 20. Table 3.1: Parameters of the implemented digital filters.. Band 1 2 3 4 5. Filter type. fp (mHz). fs (mHz). Rp (dB). Rs (dB). order. lowpass bandpass bandpass bandpass highpass. 2−7. 2−6. 2 2 2 2 2. 30 30 30 30 30. 4 3 3 3 3. 2−6. 2−5. 2−4 2−3 2−2 2−1 2−1. 2−7. 2−4. 2−3 2−2 2−1 2−0 2−0.5. The phase shift between the original signal and a filtered signal is a problem. One solution is to use zero-phase filters to have a phase difference equal to zero [4]. This type of filter is a special case of a linear-phase filter in which the phase difference is zero. The filters perform forward filtration, and subsequently, the signal is inverted to filter back in time.. 3.3.3. Aggregation features.. In this stage, 43 features were extracted from both cars and bands, which from now on will be called generating features, because they are the basis to generate new features. In total, 258 aggregation features were generated. These features were extracted in different time aggregation: 15, 20, 30 and 40 minutes, and 1, 1.5, 2, 2.6, 3, 6, 8, 12, and 24 hours. Table 3.2 summarizes the 43 aggregation features whose formulation is found in more detail in Appendix B. Likewise, to obtain a representative value in the time aggregations by the generating features, the mean was used as an aggregation function and the minimum value function for temporary features. Meanwhile, for the time of 5 minutes, three consecutive observations were used to calculate this type of features.. 3.3.4. Feature selection.. The feature selection process is carried out in two phases. The first one, those features whose variance is zero are discarded. These features are characterized because they have binary results, such as the case of the feature that computes if a value is duplicated within a subset of data, the possible values are 1 or 0. Those features that have undetermined values are also discarded, due to within their computation involve the variance and in some data, there is no variability; for instance, autocorrelation and partial correlation features. Mann-Whitney-U.

(34) Chapter 3. Methodology. 21 Table 3.2: Aggregation features used.. Feature. Feature. Absolute energy Aggregation function Autocorrelation Non-linearity measure (c3) Count above mean Energy ratio by chunks FFT aggregated by variance FFT aggregated by kurtosis First location of the maximum value Has duplicate max value Kurtosis Last location of the maximum Linear trend Longest strike below mean Mean value Mean change Median value Number of peaks Quantile Standard deviation time-reversal asymmetry statistic Variance larger than the standard deviation. Absolute sum of changes Approximate entropy Binned entropy Complexity-invariant distance (cid) Count below mean FFT aggregated by mean FFT aggregated by skew FFT first coefficient First location of the minimum value Has duplicate min value Large standard deviation Last location of the minimum Longest strike above mean Maximum value Mean abs change Mean second derivative central Minimum value Partial autocorrelation Skewness Sum values Variance. For the second phase, the Mann-Whitney-U and Fisher’s Exact statistical tests are used, the first one is applied to features with real values and the second one to those features with binary values. Both tests calculate the degree of significance of a feature with the target class as a p − value. The null hypothesis H0 establishes that the selected feature has no influence on the target class. The alternative hypothesis Ha establishes that there is a dependency between the selected feature and the target class. If H0 is rejected, the feature is kept. Otherwise, it is rejected. After calculating the p − values for each of the features, the Benjamini Hochberg [8] procedure is used, which decides the features to keep and which to discard based only on the p − values..

(35) Chapter 3. Methodology. 3.4. 22. Anomaly detection. 3.4.1. Common anomaly detection techniques. The anomaly detection algorithms are based on previous data. It is assumed that the underlying processes that generated the data have not had and will not have significant changes because they are attached to certain general principles [56]. Hence, the statistics that characterize a system continue to characterize it in the future. Different algorithms have been proposed in the literature for the detection of anomalies. They are grouped depending on the way they perform the detection. Among these groups, those proposed by Pimentel et al. [68] can be highlighted:. 1. Probabilistic-based: They assume that the data is subject to some probabilistic distribution, and through training data, they adjust their parameters, for example, Gaussian Mix Models (GMM). 2. Distance-based: They include clustering or nearest-neighbor methods. These methods rely on well-defined distance metrics to compute the distance (similarity measure) between two data points, such as k-means and LOF. 3. Domain-based: They establish a boundary in the data and base its detection on that boundary; a clear example is the One-Class Support Vector Machine (OCSVM). 4. Reconstruction-based: They are based on the reconstruction of the original data after a decomposition. The detection is performed based on the error between the original and reconstructed data as the Autoencoders do. 5. Ensemble-based: They combine the results of multiple algorithms to provide the best results, such as LODA and Isolation Forest (iF).. 3.4.2. Categories of anomaly detection algorithms. The techniques explained above can be classified into any of the three categories, which are shown in Figure 3.4. Their characteristics are detailed as follows [1]:.

(36) Chapter 3. Methodology. 23. 1. Supervised: These techniques use a training data set completely labeled as normal or anomaly. This category is also known as an imbalanced class problem. 2. Semi-supervised: Only examples of normal data or only examples of anomalous data may be available. 3. Unsupervised: Techniques that operate as unsupervised learning do not require training data. It is assumed that normal instances are much more frequent than outliers in the validation data.. Figure 3.4: Anomaly detection categories are classified as: supervised, semi-supervised, and unsupervised. The green, red, and gray blocks represent the normal, anomalous, and unlabeled observations, respectively. To train the semi-supervised model, it is necessary normal and anomalous data, the semi-supervised needs either normal or anomalous data, and the unsupervised does not require training data.. The anomaly detection is largely an unsupervised problem [1]. However, in this work, an analysis of the detection of AVT is presented in semi-supervised and unsupervised learning.. 3.4.3. Selected anomaly detection algorithms. Four algorithms were selected to detect AVT. This selection was done because they are used as a benchmark in different works [12, 25, 26, 74], and also, because they have been useful in the application in other fields such as health, cybersecurity, credit card fraud, healthcare, and personnel behaviors. A general description of these algorithms is as follows:.

(37) Chapter 3. Methodology. 24. Isolation Forest (iF) [49].This algorithm has been applied in the detection of anomalous driving patterns [93], in the detection attack in Wireless Sensor Network (WSN) [30] and the detection insider threat activity [31]. iF algorithm uses the concept of isolation to detect anomalies in a database. Based on the premise that anomalies are the minority and have values which are very different from those considered normal observations, by making anomalies more likely to be isolated from other instances when the database is randomly partitioned. This algorithm works by recursively randomly partitioning the database until it reaches a depth or isolates a point. The idea is that anomalies, since they lay further from the rest observations, will require a lower number of random database partitions to become isolated, whereas normal observations will need a higher number. One-Class SVM (OCSVM) [76]. This algorithm has been applied in the anomaly detection in the WSNs due to malfunctions [82], attacks [30], urban anomalies [95] and it is widely used as a benchmark [71, 92] OCSVM is an unsupervised algorithm which shifts the data away from the origin. The OCSVM algorithm maps input data into a high-dimensional feature space; we employ a radial basis kernel. Iteratively finds the maximal margin hyperplane, which best separates the training data from the origin and can be used as a classification rule to assign a label to a test example. Local Outlier Factor (LOF) [10]. This algorithm has been used to detect traffic data outliers [52], traffic anomaly detection [62], and as a benchmark [13, 42, 43]. LOF is a score that measures how likely a certain data point is an anomaly. The first step is computing the so-called k-distance of a point p to its k-th neighbor. The distance can be any measure, but typically the Euclidean distance is used. Then, it is computed local reachability distance (lrd) which is inversely proportional to the average distance of p to its nearest k neighbors. Then, LOF is basically the average ratio of the lrd of point p to the lrds to its neighboring points. A point is declared to be anomalous if it is significantly farther from its neighbors than they are from each other. Angle-Based Outlier Detection (ABOD). This algorithm has been used in the outlier detection over data streams [91], semiconductor manufacturing etching process [81] and as a benchmark [43]..

(38) Chapter 3. Methodology. 25. This algorithm is based on an outlier; the variance of angles between pairs of the remaining objects becomes small. For each point xi , consider all pairs of other points (xj , xk ) X, i 6= j 6= k and compute the angle between them relative to point xi . The sample variance of these angles determines the outlier score; lower variances indicate anomalous points. Because of the run-time complexity, two simple approximations were suggested by the authors. The first is to subsample the data and use this as the reference set for computing angles. The other is to consider the angles among the k-nearest neighbors to xi ..

(39) Chapter 4. Experimentation and Results In this chapter, we present the results obtained by implementing the methodology proposed in both real and synthetic vehicular traffic databases. The synthetic database was created using the traffic simulation package called SUMO [2]. First, the databases’ characteristics are explained, followed by an analysis of the feature selection process, and finally, we show the results of the AVT detection and the selection of best algorithms. The algorithms used for the detection of anomalies were evaluated in both semi-supervised and unsupervised learning. To evaluate the algorithms, as an evaluation metric, the AUC [28] was used to determine which of the four proposed algorithms obtains the best results. We implemented the methodology and SUMO simulation on a PC with an AMD Ryzen 3 2200g processor and 16 GB of RAM. The programming language used was Python. Besides, we use the algorithms from the libraries scikit-learn [66] and PyOD [96].. 4.1. Databases. We tested the proposed methodology in a real vehicular traffic database; from now on, it will be mentioned as Dodgers. This database was acquired from an inductive loop sensor, and consists of estimated vehicle count every 5 minutes over 175 days. The sensor is located on the Glendale on-ramp to the 101-North freeway in Los Angeles [9]. It is close enough to the Dodgers Stadium, and it is possible to detect the unusual vehicular traffic due to the games played in it. There is a record of 81 games which will serve as ground truth. AVT should be detected as soon as. 26.

(40) Chapter 4. Experimentation and Results. 27. Cars count. 60. 40. 20. 0 Sun 00:00. Mon 00:00. Tue 00:00. Wed 00:00. Thu 00:00. Fri 00:00. Sat 00:00. Sun 00:00. Time (Days). Figure 4.1: The plot shows one-week vehicular traffic of Dodgers database from 2005-04-16 to 2005-04-23. A strong seasonal behavior is observed.. the game ends. Figure 4.1 shows one-week vehicular traffic of Dodgers database. The total observations of this database are 50,400. The synthetic database, which will be referred from now on as the Atizapan database, was obtained from Lecherı́a-Chamapa freeway on-ramp, located at Bosques de Ixtacala, in the Atizapán municipality, Estado de México, Mexico. Figure 4.2 shows freeway on-ramp1 . A total of 202 days was simulated with vehicle counts every 5 minutes. The number of days with AVT is 30, which will serve as ground truth. In Figure 4.3, one-week vehicular traffic of this database is shown. Appendix C explains in detail how the simulation was build using SUMO. There are 58,176 observations. Both databases are in google drive repository 2 .. 4.2. Experimental setup. The methodology was applied to 32 databases by each time aggregation to analyze which features provide more information to detect AVT. These databases are the result of the combination without repetition of the bands from 0 to 5 bands. The total databases created are 480 for each one of the two databases. The anomaly detection algorithms used in this work are: iF, OCSVM, LOF, and ABOD. The data was normalized before training the algorithms. These were trained in semi-supervised and unsupervised learning. In the first one, cross-validation with five iterations was used without 1 2. https://www.google.com.mx/maps/@19.6088046,-99.2377058,15z https://drive.google.com/drive/folders/1Noqgm1PXNg-qmjuP0GQtnkV918JWIKMq?usp=sharing.

(41) Chapter 4. Experimentation and Results. 28. Figure 4.2: The red circle indicates the Lecherı́a-Chamapa freeway on-ramp. 400. Cars count. 300. 200. 100. 0 Sun 00:00. Mon 00:00. Tue 00:00. Wed 00:00. Thu 00:00. Fri 00:00. Sat 00:00. Sun 00:00. Time (Days). Figure 4.3: The plot shows one-week vehicular traffic of Atizapan database is shown. Its distribution is similar to Dodgers’ one-week vehicular traffic.. considering the anomalies, as these were added only in the five different test sets. The AUC was calculated for each iteration to obtain an averaged AUC. For the second one, the entire database was used as validation, and the AUC value obtained was used as the evaluation metrics. We use statistical tests to know which algorithm has better performance. We compared the best algorithm obtained by statistical tests in unsupervised learning with the algorithm proposed by Ihler et al. [37]. From now on, we will refer to this algorithm as MMPP. It is characterized by being an unsupervised model. This algorithm was developed for the detection of unusual events in the Dodgers database. It was only tested using the whole database with the sampling time of 5 min. However, in this dissertation, the experimentation will also be carried out by using all-time aggregations, and we will compare the MMPP algorithm.

(42) Chapter 4. Experimentation and Results. 29. 300. Missing Values. 200. 100. 0 100 101 105 106 108 111 143 144 145 174 178 179 180 181 185 186 193 194 195 201 202 216 253 254 255 256 257 258 259 260 274. Day of Year. Figure 4.4: The graph shows the number of missing values for those days that contain them. The red line represents the threshold resulting from equation 3.1, whose value is 92. The x-axis represents the number of days regards to the year.. with the best model obtained from our methodology process.. 4.3. Methodology implementation. The following subsections describe the results obtained from the imputation algorithm applied to the Dodgers database, as it is the only one that presents missing values. Next, the features extraction process to both Atizapan and Dodgers databases is described.. 4.3.1. Missing values imputation. The Dodgers database has 31 days, which have missing values. Figure 4.4 shows the number of missing values for the 31 days. Within the missing values, the days, 178, 179, and 253 belong to the ground truth and do not comply with the equation 3.1; thus, they are removed from the database. Therefore, the number of events of the Dodgers database is 78. Only 19 days contains missing values below 92 allowed, so the imputation method proposed in section 3.2.3 was applied. Figure 4.5(a) shows an example of day 145 from the Dodgers database with missing values specified by -1 and in Figure 4.5(b) the same day with the imputed missing values by our algorithm. Meanwhile, the Atizapan database does not have missing values..