4 PROPUESTA DE PROGRAMAS DE EDUCACIÓN Y SERVICIOS DE ASESORÍA EN
4.3 Desarrollo de la propuesta de educación continuada y servicios de asesoría en
4.3.4 Desarrollo de fichas académicas de educación continuada y eventos
Data preparation includes pre-processing steps to obtain NTS and PTS sets from raw individual vehicle data.
5.5.1.1. Raw Data Cleansing
Raw data are stored in text files and therefore, it’s necessary to control whether data lines (with class of vehicles excluded) can be converted into numerical data. Raw data can be corrupted under many forms such as:
Reset text of detectors. Traffic detectors usually reset counters after midnight when the traffic is low. Special text appears in data files indicating the reset. Reset duration usually lasts for about 15 minutes.
Failure of detectors. The frequency of failure is higher in 2002, 2003, and 2004 at study site CH023. Special characters appear when the failure happens.
Errors within data lines. Data lines should be in the format presented in Figure 4-3. Otherwise, the data lines should be removed.
When all the corrupted text in raw data is removed, raw data are aggregated based on lanes.
5.5.1.2. Aggregated Data Cleansing
As 5-minute intervals are used in the current study for aggregating data, any data interval containing insufficiently raw data for the interval is removed. This usually occurs when there is a detector reset and the reset starts within an aggregation interval. The aggregation interval starts at minute 0 or minute 5 (for example, at 12:00 or 17:45 or 21:15). If the reset starts at 12:33 and finishes at 12:47, raw data from 12:33 to 12:47 is not available. Moreover, the intervals from 12:30-12:35 and from 12:45-12:50 are also removed as there is not sufficiently raw data for those intervals. The same data cleansing is also applied to the intervals when there is failure of detectors.
In case of errors within data lines, corresponding lines are simply removed.
5.5.1.3. Missing Values
When data for two lanes of the same direction are aggregated, the aggregated data for one lane is matched with aggregated data for the other lane to generate traffic situations. There are periods when there is no vehicle on one lane whereas there are vehicles on the other lane. In this case, data for the no-vehicle lane are generated to match with the lane where there are vehicles. 7 lane-based parameters are generated for no-vehicle lane (see Table 5-1):
Occupancy: 0%. Average Speed: 0 km/h.
Average headway, standard deviation of headway, standard deviation of speed, percentage of heavy vehicle are also set to zero, similarly to average speed.
The zero value of volume and occupancy reflects the reality as there is no vehicle during the aggregation interval. However, the zero value for speed is a dummy value because if there is no vehicle, there is no speed. Similarly, dummy zero values are assigned to Average headway, standard deviation of headway, standard deviation of speed, percentage of heavy vehicle. These dummy values should not influence the final results as the corresponding traffic situations will be clustered into a special group.
5.6. Summary
This chapter discusses the application of the first step of the methodology to the data collected form study site. 22 variables are used to characterize Traffic situations that are aggregated for 5-minute intervals. Before NTS and PTS are specified, crash time correction is discussed with crash time estimation using shockwave theory. To agree with aggregation time interval, crash time is shifted earlier to match with the end of the last traffic situation.
Other important design choices are made for pre-crash period such that before shifted crash time, there are 6 PTS characterizing the traffic evolution before the crash. There are data unused which are data within pre-crash buffer period and post-crash period. The exclusion of these periods should not influence the overall performance of developed models.
Finally, the working data set of the current research with the selected study includes 1’160’831 NTS and 720 PTS (for 120 crashes). The imbalance ratio of the classes is IMRO=1’612.27.
6-74
Chapter 6 Data Sampling & Traffic Regimes
This chapter presents design choices with regard to data sampling process and provides in depth analysis of the traffic conditions obtained from clustering process called traffic regimes. Traffic situations are transformed before being sampled. After the data sampling process, traffic situations under original form are used in subsequent chapters.
It is worth noting that the terminology “traffic regime” might be used elsewhere to indicate totally different meanings. Traffic Regimes in the current research should not be linked to any other study.
6.1. Introduction
According to preliminary crash analyses presented in section 4.5.4, most of the crashes occurring at the selected study site are rear-end and sideswipe crashes (120 out of totally 170 crashes). Figure 4-15 in particular show that these crashes happen mostly under high flow or congested conditions. This means that there must be certain dominating traffic conditions for the considered collisions to appear and there is low chance for these types of crashes under other traffic conditions.
Section 3.3.1 presents an example of model development using all available pre-crash and non-crash data without sampling non-crash data with bad results obtained. It means that there is an imbalance between pre-crash and non-crash data that lower the performance of machine learning methods.
Previous studies in the literature also mention partly to the need of sampling non-crash data as summarized in section 2.4.4. However, the data sampling approaches are rather arbitrary and there is non- crash data unused in the sampling process.
Here, the data sampling methodology proposed in section 3.3 is applied to NTS and PTS data sets introduced in Chapter 5. Section 6.2 will discuss about the design choices of non-crash data sampling methodology. Section 6.3 will provide analysis on results of data sampling process which are called
traffic regimes. Section 6.4 discusses the link between NTS and PTS with traffic regimes.