• No se han encontrado resultados

3. METODOLOGÍA DE LA INVESTIGACIÓN

4.4. Sistema de Control Interno Contable

4.4.1. Dinámicas de las principales Cuentas Contables

The descriptive statistics of mean absolute errors with the SM method are shown

in Table.4.2. Table.4.2 shows that with the SM method the best result of least MAE of .1533 was obtained for station 7 of CO monitoring. The second least error

result is of .2237 for imputing the missing data of station 6. Hence, the lower .1533 MAE with SM method shows that prediction imputation of missing data to actual

values with this result showed least error when it came to imputation of missing data. The descriptive statistics of mean absolute errors with MNP method are shown in

Stations N Minimum Maximum MAE Std.Deviation

Station 1 8783 .00 6.97 .2992 .45308 Station 2 8783 .00 7.49 .9134 .70834 Station 3 8783 .00 3.64 .2246 .25858 Station 4 8783 .00 7.20 .3551 .47763 Station 5 8783 .00 8.21 .4405 .42733 Station 6 8783 .00 5.19 .2237 .33987 Station 7 8783 .00 6.45 .1533 .25284

Table 4.2: Mean Absolute Errors with SM Method

Table.4.3. Table.4.3 shows that .254 MAE for station 7 was obtained. This is the best result available with this method. The second best result was achieved with

an MAE of .256 for station 3. The results of .254 and .256 MAEs were obtained in imputation of predicted actual values. Further results of this method showed

how close the results were in terms of MAE for predicting missing values for each monitoring station. The descriptive statistics of mean absolute errors with MDNP

Stations N Minimum Maximum MAE Std.Deviation Station 1 8544 .03 6.97 .3075 .45657 Station 2 8783 .01 7.49 .9242 .70189 Station 3 8783 .01 3.64 .2309 .25621 Station 4 8782 .00 7.20 .3645 .47509 Station 5 8781 .01 8.21 .4499 .42179 Station 6 8783 .01 5.19 .2283 .33967 Station 7 8771 .00 6.45 .1650 .25036

Table 4.3: Mean Absolute Errors with MNP Method

was obtained for station 7. However, MAE of .9236 is highest for station 2. The results of this method demonstrated that least .1607 of MAE was obtained through

this method of imputation compared with actual values. The descriptive statistics

Stations N Minimum Maximum MAE Std.Deviation

Station 1 8544 .03 6.97 .3075 .45657 Station 2 8783 .01 7.49 .9236 .70298 Station 3 8783 .00 3.65 .2295 .25715 Station 4 8782 .01 7.21 .3629 .47628 Station 5 8781 .01 8.21 .4494 .42276 Station 6 8783 .01 5.19 .2275 .33991 Station 7 8771 .01 6.46 .1607 .25324

Table 4.4: Mean Absolute Errors with MDNP Method

of mean absolute errors with the LI method were shown in Table.4.5. Table.4.5

shows that imputation for missing data prediction through Linear Interpolation (LI) received .1620 MAE for station 7. However, the second best result for this method

was achieved with an MAE of .2280 for station 6. The descriptive statistics of mean absolute errors with the LTP method are shown in Table.4.6. Table.4.6 shows that

the best result with LTP is obtained with a minimum MAE of .1597 for station 7 for prediction of missing values. The second best result was obtained with an MAE

Stations N Minimum Maximum MAE Std.Deviation Station 1 8544 .04 6.96 .3250 .44736 Station 2 8783 .01 7.49 .9220 .70219 Station 3 8783 .00 3.65 .2281 .25681 Station 4 8783 .00 7.20 .3616 .47535 Station 5 8783 .01 8.20 .4573 .41793 Station 6 8783 .00 5.19 .2269 .33914 Station 7 8773 .00 6.45 .1597 .25134

Table 4.5: Mean Absolute Errors with LI Method

of .2269 for station 6. Overall, the SM method demonstrated best in prediction of

Stations N Minimum Maximum MAE Std.Deviation

Station 1 8544 .03 6.97 .3075 .45657 Station 2 8783 .01 7.49 .9239 .70339 Station 3 8783 .00 3.65 .2305 .25769 Station 4 8783 .00 7.20 .3637 .47602 Station 5 8782 .01 8.21 .4505 .42326 Station 6 8783 .01 5.19 .2280 .34028 Station 7 8771 .00 6.46 .1620 .25268

Table 4.6: Mean Absolute Errors with LTP Method

missing data, having the lowest MAE of .1533 for station 7. This was followed by

the LTP method having an MAE of .159 for station 7 also. Relatively all the five imputation methods used in this study performed well, but among the five imputation

methods, the best results were obtained through the SM method followed by the LTP method with MAE achieving the least effective results.

We tried to classify the CO data set by using all the above five imputation methods for missing CO data by creating an SVM ensemble with each method missing imputed

data. We deployed five imputation methods used in this research for filling missing data in CO analysis, and each method of classification accuracy was evaluated by

creating an ensemble using bagging and boosting algorithms.

Firstly, we deployed SM method for imputation of missing data and created an SVM ensemble with this data. The ensemble obtained with the SM method imputed

data using the adaBoostM1 algorithm resulted in a classification accuracy of 76.9% based on the confusion matrix illustrated in Fig.4-3. The ensemble obtained with

Figure 4-3: SM Method Confusion Matrix AdaBoostM1 Algorithm

the SM method using bagging algorithm resulted in 74.6% classification accuracy

based on the confusion matrix illustrated in Fig.4-4. The ensemble obtained with im- puted method MDNP with the adaBoostM1 algorithm resulted in 76.7% classification

based on confusion matrix as shown in Fig.4-5. However, the ensemble using MDNP method with the bagging algorithm resulted in 75.0% classification accuracy based

on the confusion matrix illustrated in Fig.4-6. The ensemble based on MNP method resulted in 76.7% classification accuracy based on the confusion matrix using the ad-

aBoostM1 algorithm as shown in Fig.4-7. With this method classification accuracy of the ensemble also resulted same in 76.7% using the bagging algorithm based on

the confusion matrix as illustrated in Fig.4-8. 76.9% of classification accuracy based on the ensemble creation was obtained with the LI method using the adaBoostM1

algorithm based on the confusion matrix as shown in Fig.4-9. A similar percentage of 76.9% was obtained using the bagging algorithm deploying the LI method based

Figure 4-4: SM Method Confusion Matrix Bagging Algorithm

Figure 4-5: MDNP Method Confusion Matrix AdaBoostM1 Algorithm

on the confusion matrix as shown in Fig.4-10. A classification accuracy of 76.5% was obtained in ensemble creation using the LTP method for imputing missing data by

deploying the adaBoostM1 algorithm based on the confusion matrix as shown in Fig.4- 11. A similar percentage of 76.5% of classification accuracy was obtained in ensemble

creation using the bagging algorithm with the LTP method as illustrated in Fig.4-12 through the confusion matrix. Based on the results of classification accuracy of all

Figure 4-6: MDNP Method Confusion Matrix Bagging Algorithm

Figure 4-7: MNP Method Confusion Matrix AdaBoostM1 Algorithm

imputation methods, we can conclude that the best result of classification accuracy

of 76.9% is obtained with the SM method using adaBoostM1 and bagging algorithms. The second best imputation method for filling missing data were MNP which had a

Figure 4-8: MNP Method Confusion Matrix Bagging Algorithm

Figure 4-9: LI Method Confusion Matrix AdaboostM1 Algorithm

4.6

Conclusion

This study examined the effectiveness of existing SM, MNP, MDNP, LI and LTP

imputation methods in terms of their error and classification accuracy in ensemble creation. There are various other effective methods proposed for dealing with missing

Figure 4-10: LI Method Confusion Matrix Bagging Algorithm

Figure 4-11: LTP Method Confusion Matrix AdaboostM1 Algorithm

only the SM, MNP, MDNP, LI and LTP imputation methods that are already imple- mented in SPSS and are used by researchers who are gaining useful results. However,

these imputation methods performances in ensemble creation have not been evaluated in the previous research; hence this work has achieved a significant contribution to

Figure 4-12: LTP Method Confusion Matrix AdaboostM1 Algorithm

Results of experiments in this research have successfully identified that the SM method produced the lowest MAE compared with other imputation methods in en-

semble creation. Further, ensemble creation with the SM method resulted in better classification accuracy compared with other methods using bagging and boosting

algorithms in our research. Importantly, it is noticeable that the percentage of per- formance accuracy margin among the imputation methods is not very high, but the

SM method comparatively possessed better imputation results for our experiments. This work is limited to smaller data set i.e. 8783 observations. This work could

be extended to larger data sets. Secondly, further work is required in the validity of the SM method results by using various patterns of missing data. In the literature,

patterns of missing data have been used for imputation of missing data and results were obtained successfully. Thirdly, how each of the imputation methods in this

study influences the performance of classifiers in creating ensemble. Fourthly, this study could be further extended by widening the numbers of performance indicators

Chapter 5

Proposed Scalable SVM Ensemble

Learning Method (SSELM)

5.1

Introduction

Machine learning algorithms must keep up with the latest developments in environ-

mental and other applications. The performance of machine learning algorithms is dependent on sufficient and reliable data coming from various sources. With the

internet revolution, data are nowadays easily accessible through electronic sources. However, environmental data are sometimes incomplete and contains missing data be-

cause of equipment failure, human error or incorrectly setting up monitoring stations’ dimensions etc. Therefore, present machine learning models for air pollution predic-

tion analysis do not represent the true picture of air pollution. Further, the existing online models are confronted with the problem of accommodating the huge amount

of spatio-temporal data. In this regard scalability of machine learning models and aggregating various model decisions play an important role in online spatio-temporal

air pollution prediction. This chapter investigates the Scalable SVM Ensemble Learn- ing Method (SSELM) in relation to spatio-temporal Nitrogen dioxide (NO2), carbon

monoxide (CO) and ozone (O3) predictions for Auckland over its Auckland wide

monitoring stations.

we breathe. These air pollutants can cause hazy days, unpleasant smells and have

adverse effects on our health. Air quality is dependent on the amount of pollution released into the air by human and natural activities, the degree of diffusion because

of wind and weather effects and chemical reaction among the pollutants (Xue et al., 2016).

In Auckland, concentration of these pollutants is measured by the Auckland Re-

gional Council (ARC) because they are known to endanger human well-being and health. The pollutant concentrations in New Zealand are measured to national stan-

dards or to regional air quality targets. On average, every Aucklander breathes 11,000 litres of air every day. However, New Zealand air quality is clean compared with other

nations, but we have the worst asthma death rate (Mattke, Kelley, Scherer, Hurst, & Lapetra, 2006). Asthmatic people are sensitive to poor air quality, which in Auckland

is primarily caused by motor vehicles and other sources such as domestic fires. Hence, long-term exposure to air pollution results in an increase in cardiovascular and lung

diseases resulting in heavy medical costs to the general public (Perera, 2017) besides damage to other living organisms such as vegetation.

Air pollution threats to public health led to the development of several machine learning models to predict air quality for the future. However, air pollution moni-

toring is a complex task. Monitoring stations consist of sensor devices which collect each pollutant concentration every minute across the whole year. Knowledge discov-

ery from such a large amount of data are again complex, time consuming and quite expensive task. Air pollution is a spatio-temporal problem, and spatio-temporal data

are always in huge. Hence, machine learning practitioners are confronted with the problem of handling this large amount of data and computational time constraints to

solve air the pollution problem. However, machine learning practitioners have to face the missing values in the data problem which is a time-series problem. With missing

values, a machine learning model would produce biased results in specific characteris- tics of spatio-temporal prediction tasks. Because of the above problems encountered

in machine learning, we proposed a method scalable SVM ensemble (SSELM) for air pollution prediction for the future. SSELM will have the ability to handle a large

amount of spatio-temporal data and to conduct environmental prediction on missing

data. The proposed method will predict the air pollution status for the whole region based on CO, NO2and ground level O3 concentrations.

The remainder of this chapter is organised as follows: Section 5.2 describes motiva- tion for the distributed computing. Section 5.3 discusses the proposed online scalable

SVM ensemble learning method and provides insight into the method. Section 5.4

focuses on possible outcomes from the proposed method. Section 5.5 presents the capabilities of aggregated SVMs. The information regarding the data set and experi-

mental setup is provided in section 5.7. The proposed model performance evaluation criterion is provided in section 5.8. Section 5.9 focuses on experimental results and

discussion and finally in section 5.10 the conclusion to this chapter is provided.