Study and Prediction of Air Quality in Smart Cities through Machine Learning Techniques Considering Spatiotemporal Components

Air quality is one of the major concerns for stakeholders in science, government and society. Finally, a fifth contribution is the implementation of spatio-temporal air quality prediction methods that were.

Resumen

Finalmente, una quinta contribución es la implementación de modelos espaciotemporales de predicción de la calidad del aire, evaluándolos sobre la ciudad de Madrid y varios escenarios definidos. En general, las novedades del presente trabajo son: estudio de los componentes espaciotemporales para la predicción de la calidad del aire (dióxido de nitrógeno);

Acronyms

Motivation

It is therefore essential to carry out a spatio-temporal analysis to capture and process all the above-mentioned dependencies. Before performing a spatiotemporal analysis, in the first phase, it is necessary to obtain data on the factors that shaped and controlled air quality.

Research Questions and Objectives

RO5: Incorporate different data sources, including air quality, meteorological and traffic data sets, as well as the location of air quality and meteorological monitoring stations and traffic measurement points, and process them by implementing different functional engineering techniques: Related Chapter 3.

Research Contributions

Thesis Structure

Air quality prediction in smart cities using machine learning technologies based on sensor data: a review. Applied Sciences 10, no. The first section presents the procedure for selecting studies related to air quality prediction using ML methods and the analysis of the extracted features and components.

Machine Learning Models for Air Quality Predic- tion

The distribution of data combinations in terms of the study area is presented in Figure 2.4. It may also be interesting to look at the distribution of study areas in terms of forecasting objectives.

Graph Neural Network for Air Quality Prediction

It is noticeable that most newspapers used graphs consisting of weighted edges (seventeen out of eighteen). It is also very interesting to see the distribution of the datasets in chronological order.

Summary

This chapter provides a detailed explanation of the proposed methodology and an extensive observation and review of the datasets used. We gave a description of the study area, i.e. the cities of Madrid and. 2 Part of this chapter has previously been published as articles in the journals IJCIA, PloS one, IEEE Access and Data in Brief, and as papers at the AGILE and EnviroInfo conferences.

Description of Study Area and Prediction Target

During the windiest part of the year (January 27 - May 7) the average wind speed is about 3.5 m/s. Madrid City Layer, Air Quality Stations Layer, Meteorological Stations Layer and Traffic Measurement Points: Madrid City Council Open Data Portal D.

Data Preparation

This decline is related to the implementation of environmental legislation and technologies and the consequences of the global economic crisis. Traffic Data - As traffic data attributes can be specific to a certain area, below are selected traffic attributes with their definition for the city of Madrid. Since the locations of air quality stations, meteorological stations and traffic measuring points are different, it is necessary to combine them spatially and temporally.

Exploratory Data Analysis

Regarding the spatial correlation, Figure 3.6 shows the heat maps for detecting the correlation between time series in the stations. For example, figure 3.10 shows that at a station with id=96 during January (still wind 3.63%), the predominant direction is northeast, but a higher wind speed is recorded in a westerly direction. This is because the average traffic speed is only available for the M30 road, which is 15.8% of the study area (Figure 3.14 shows the average traffic speed over a period of one week).

Feature Engineering

The idea of IDW is to predict values for unknown points based on the values of known points. Transformation: this technique was used to transform the wind direction in the following ways: 1) transform it into categorical data with the following categories: . north, east, south, west, southwest, northeast, southeast, northwest and later through One Hot Encoder25 or 2) conversion to u and v components using the following equations (Eq. The division of the data into the above steps will be available for each model in the following chapters , which varies depending on the application that monitors model development.

Machine Learning Methods

Machine Learning Concept
Artificial Neural Network
Proposed Methods

It uses previous outputs as input, meaning that the input consists of two elements: the present and the recent past. In the case of directed edges, an edge has a source node and a destination node, which means that information flows from the source to the destination node. The alignment results indicate how well the elements of the input sequence and the current output match.

Summary

Convolutional Long Short-Term Memory Network 1

Experimental Analysis

Treatment of undefined data: outliers were detected based on summary statistics of the data sets (defined in section 3.4). It turned out that the performance of the 16-filter model is higher than that of the other filters. The performance of the model with 9 × 9 kernel was found to be higher than that of other kernels.

Results and Discussion

Regarding the two different periods, the pandemic period outperforms the non-pandemic period in the second scenario for all time intervals, in particular, for the best performance detected in the 1-hour time interval, the pandemic period outperforms the non-pandemic one. period in terms of ConvLSTM by 16.44% (pandemic period-1.22 μg/m3, non-pandemic period - 1.46 μg/m3), and in terms of LSTM by 3.31% (pandemic period-1.46 μg /m3, non-pandemic period - 1.51 µg/m3). Although the variance of the pandemic year is lower than for the non-pandemic year, the algorithms are trained and tested separately for each period, meaning that the models will most likely learn and generalize all existing models for both periods during training. In terms of temporal granularity, the 1-h granularity outperformed the other granularities in all sub-scenarios, but this trend does not hold for other temporal granularities, which may be related to the selection of historical time lags [124].

Summary

One of the main objectives of this chapter was to address the impact of COVID-19 on pollution formation. The final results showed that the proposed model outperformed the LSTM, which can be explained by the ability of the ConvLSTM to generalize and transfer the spatiotemporal information. In terms of datasets, the analyzes performed with selected features outperformed the results performed with all features due to the disadvantage of high dimensionality.

Bidirectional Convolutional Long Short-Term Memory Network 1

Experimental Analysis

It should be mentioned that the wind direction is also chosen taking into account the correlation with the wind speed. The results showed that the performance of the 3-layer model was higher than that of the other two options. In general, the model architecture was built based on the selected parameters by stacking three layers of BiConvLSTM with a kernel size of 3 × 3, 16 filters and with an Adam optimizer.

Results and Discussion

The results of the first sub-scenario were better than the results of the second sub-scenario with all features included. In particular, MI worsened the results of the first subscenario, but increased the overall performance of the second subscenario. In the case of the first subscenario, the best combination of features is obtained when K=7 (RMSE–3.44, MAE–2.87).

Summary

The distribution of air quality stations in the city of Madrid (Figure 3.5) has no specific pattern, they are distributed without any order of significance. To perform predictive analysis in the air quality monitoring stations, taking into account their spatiotemporal relationships, a GNN can be implemented that can handle non-Euclidean structured data. 1Part of this chapter previously appeared as an article in the Journal of IEEE Access and as an article in the Conference of EnviroInfo.

Experimental Analysis

The following paragraphs describe the stages of the experimental analysis (Experimental Analysis), the output of this analysis and discussions that follow from this output (Results and Discussion). In the mathematical expression, the above procedure can be defined as a function of the air quality station network G and the characteristic matrix X (Eq. 6.2). Madrid city layer, air quality station layer: Open Data portal of the City Council of Madrid D.

Results and Discussion

It should be mentioned that the analysis was performed in the Google Colab cloud service using the PyTorch Geometric Temporal library [126]. The results of the analysis are shown in Table 6.4 (the best results are shown in bold). Regarding the time interval pattern, the results at closer time intervals exceeded the outcomes at more distant intervals in the case of hidden units by 256.

Summary

As for RMSE and MAE, their units match the unit of the target variable (NO2: µg/m3). It is important to mention that when comparing only the reference methods among them, it can be observed that TGCN outperforms the other two methods (LSTM and GRU). Since TGCN is also a graph-based method, based on these findings, the advantage of a graph-based method with the ability to capture spatial dependencies in addition to temporal dependencies can be highlighted.

Conclusions and Future work

Regarding the relationships between the NO2 and the rest of the characteristics, the most relevant characteristic appeared to be wind speed. For example, if we compare the results of the first sub-scenario with all features included in Table 5.4 (applied BiConvLSTM) with Table 6.4 (applied A3T-GCN), it can be seen that the results are close. Regarding the architecture of the proposed models, further modifications can be made, for example, in the case of A3T-GCN, several layers can be stacked.

Appendix A Publications

Features of the selected papers

Method: SSH-GNN – Self Guided Hierarchical Graph Neural Network, DGCN–Dual Graph Convolution Network, DP-DDGCN–. Dual-path Dynamic Directed Graph Convolutional Network, ST-DGCN–Spatial-Temporal Dynamic Graph Convolution Neural Network, MST-GCN–Multi-scale Spatiotemporal Graph Convolution Network, ATGCN–Attentive Temporal Graph Convolutional Network, GAGNN–Group-aware Graph Neural Network ;Target: AQI–Air Quality Index, PM2.5–Particulate matter less than 2.5 microns in diameter, PM10–Particulate matter less than 10 microns in diameter, NO2–nitrogen dioxide, CO–Carbon monoxide, O3–Ozone ;Dataset: MET – Meteorological, POI – Point of Interest, RND – Road network data; Evaluation metric: MAE – Mean Absolute Error, RMSE – Root Mean Square Error, ACC – Accuracy. Method: HGNN–hierarchical graph neural networks, LR–linear regression, ARIMA–autoregressive integrated moving average, MLP–multilayer perceptron, GCN–convolutional neural network graph, STGCN–spatial-temporal graph convolutional network, ASTGCN–attention-based spatial-temporal graph Convolution Network, GCRNN–Graph Convolutional Recurrent Neural Network, AQSTN–Air Quality Spatial-Temporal Network, SpAttRNN– Spatio-Attention embedded Recurrent Neural Network; Evaluation statistic: R2– Coefficient of determination, spRMSE –Spatiatemporal RMSE, MAPE –Mean Absolute Percentage Error , MSE – Mean Squared Error, IA – Index of Similarity, SMAPE – Symmetrical Mean Absolute Percentage Error.

Appendix C

Reproducibility

The location of the air quality monitoring stations is available in .csv, .xlsx and .geoformat3. The location of the meteorological monitoring stations is available in .csv, .xlsx and .geo format6. Madrid Graph Network.ipynb contains the procedure for building a graph network of the air quality stations placed in the city of Madrid.

Appendix D

The Tools Used

Bibliography

Modeling interstation relations with attentive temporal graph convolution network for air quality forecasting. Spatial air quality index prediction model based on decomposition, adaptive boosting and three-stage function selection: a case study in China. A data ensemble approach to real-time air quality forecasting using extremely randomized trees and deep neural networks.