• No se han encontrado resultados

Classification trees and PM10 dynamics in Bogotá, Colombia

N/A
N/A
Protected

Academic year: 2020

Share "Classification trees and PM10 dynamics in Bogotá, Colombia"

Copied!
40
0
0

Texto completo

(1)

Classification Trees and PM10 dynamics in

Bogot´

a, Colombia.

Universidad de los Andes

Undergraduate thesis submitted for the degree ofEnvironmental Engineer.

Santiago Arango Pi˜neros

Advisor: Ricardo Morales Betancourt, Ph.D

(2)

Abstract

In 1984, Breiman, Friedman, Olshen and Stone published the bookClassification

and Regression Trees [1] in which they developed the theory of a new nonpara-metric tool for data analysis. The proof of the methods value is that it remains in

force today as one of the main techniques in statistical learning (see [6], [7]).

Clas-sification and Regression Trees have been used in various settings, from medicine to air quality. This thesis applies the classification techniques introduced by

Brie-man et al. for the analysis particulate matter dynamics in Bogot´a city. In section

2, we start by giving a general introduction to the classification methodology with an algorithmic taste, designed to understand the applications rather than the underlying mathematical theory. In section 3 we give a detailed description of the databases used and the properties of the studied variables. In section 4 we describe the methodology followed and the preliminary results that were obtained in the experimentation phase. We hope that this section will be useful to under-stand the choosing of predictive variables and the selected format of the data. In section 5 we present the specific results of each studied station and afterwards we in section 6 infer more general conclusions at the city level. Finally, in section 7 we discuss a nice application of the results obtained and in section 8 we propose future lines of work on these topics.

(3)

Contents

1 Introduction 5

2 CART 7

2.1 Classification Trees . . . 7

2.2 Handling of Missing Data . . . 8

2.3 Hierarchy of Variable Importance . . . 9

2.4 Node numbering . . . 9

3 Data 10 3.1 Radiosonde data . . . 12

4 Methodology 14 4.1 Methodology Justification . . . 14

5 Results 18 5.1 Daily Average PM10 CART for station E2 - “Carvajal”. . . 18

5.2 Daily Average PM10 CART for station E5 - “Centro de Alto Rendimiento”.. . . 20

5.3 Daily Average PM10 CART for station E7 - “Fontib´on”. . . 22

5.4 Daily Average PM10 CART for station E9 - “Kennedy”.. . . 24

5.5 Daily Average PM10 CART for station E10 - “Ferias”. . . 26

5.6 Daily Average PM10 CART for station E13 - “Puente Aranda”. . . 28

5.7 Daily Average PM10 CART for station E15 - “Suba”. . . 30

5.8 Daily Average PM10 CART for station E18 - “Usaqu´en”. . . 32

6 Conclusions 34 7 An application 35 8 Future Work 36 8.1 Maximum daily PM10 . . . 36

8.2 Regression Trees . . . 39

List of Figures

1 Air quality monitoring network of Bogot´a. . . 6

2 Surrogate variables. . . 8

3 PM10 average day profile, 2009-2011. (Station E2) . . . 10

4 Histograms for the atmospheric stability variables. . . 13

5 First experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011. . . 14

6 Second experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011.. . . . 15

7 Third experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011. . . 16

8 Fourth experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011. Same as 7 but with Sundays and Holidays excluded.. . . 17

9 Mean PM10 histogram, station E2-“Carvajal”. . . 18

10 Mean PM10 classification tree for station E2. . . 19

11 Mean PM10 histogram, station E5 - “Centro de Alto Rendimiento”. . . 20

12 Mean PM10 classification tree for station E5. . . 21

13 Mean.WndSpand Max.WndSp boxplots, station E7. . . 22

(4)

15 Mean PM10 classification tree for station E7. . . 23

16 Mean PM10 histogram, station E9 - “Kennedy”. . . 24

17 Mean PM10 classification tree for station E9. . . 25

18 Mean PM10 histogram, station E10-“Ferias”. . . 26

19 Mean PM10 classification tree for station E10. . . 27

20 Mean PM10 histogram, station E13 - “Puente Aranda”. . . 28

21 Mean PM10 classification tree for station E13. . . 29

22 Mean PM10 histogram, station E15 - “Suba”. . . 30

23 Mean PM10 classification tree for station E15. . . 31

24 Mean PM10 histogram, station E18 - “Usaqu´en”. . . 32

25 Mean PM10 classification tree for station E18. . . 33

26 Max PM10 histogram, station E7 - “Fontib´on”. . . 37

27 Max. PM10 classification tree for station E7. . . 38

28 Mean PM10 regression tree for station E2. . . 39

List of Tables

1 BAQMN air quality monitoring stations. . . 11

2 Pollutants and meteorological variables measured in BAQMN.. . . 11

3 Studied variables in both BAQMN and radiosonde databases from 2009 to 2011, for mean PM10. . . 11

4 1/7th-percentiles used in the mean PM10category definitions for each studied station from 2009 to 2011. . . 12

5 Variable importance Average PM10 CART, station E2-“Carvajal”. . . 18

6 Variable importance Average PM10 CART, station E5 - “Centro de Alto Rendimiento” . 20 7 Variable importance Average PM10 CART, station E7 - “Fontib´on” . . . 22

8 Variable importance Average PM10 CART, station E9 - “Kennedy” . . . 24

9 Variable importance Average PM10 CART, station E10-“Ferias” . . . 26

10 Variable importance Average PM10 CART, station E13 -“Puente Aranda” . . . 28

11 Variable importance Average PM10 CART, station E15 - “Suba” . . . 30

12 Variable importance Average PM10 CART, station E18 - “Usaqu´en” . . . 32

13 Studied variables in both BAQMN and radiosonde databases from 2009 to 2011, for daily maximum PM10. . . 36

14 1/7th-percentiles used in the max. PM10category definitions for E7- “Fontib´on” from 2009 to 2011. . . 37

(5)

1

Introduction

Particulate matter (PM) stands out as one of the most important pollutants in cities for two reasons: Firstly, PM is generated primarily by transportation, industry and power generation; three sectors promi-nent in cities. Secondly, it induces serious health problems for the exposed population; several epi-demiological studies have found an association between PM concentrations in urban air and short-term

cardiopulmonary effect [2].

Bogota’s air quality monitoring network (BAQMN), see figure 1, has operated over ten years collecting

hourly measurements of several meteorological and environmental variables. The measured variables are temperature, precipitation, wind speed and direction, concentrations of PM with aerodynamic diameter less than 10 micrometers (PM10), concentrations of PM with aerodynamic diameter less than 2.5

mi-crometers (PM2.5), NO2, SO2, CO and others. In addition, there is a radio probe located at El Dorado

international airport which measures the atmosphere’s temperature profile every day at 7:00 a.m.

The purpose of this work is to identify the key environmental variables that better explain observed variations in the concentration of pollutants and identify causal relations between them in order to understand air quality dynamics in the city.

Statistical methods have proven successful in the task of developing air quality models. Since the relation between variables involved in air quality problems is often not linear, researchers have shifted from conventional regression methods to new statistical tools. One of the most attractive approaches nowadays

is the development of artificial neural networks (ANN). Antanasijev´ıc, Pocajt, Povrenovi´c, Risti´c, and

Peri´c-Gruji´c [5] developed an ANN model for the forecasting of annual PM10using economical parameters

(related to emission inventories). The model proves to produce better results than the traditional multi-linear regression and principal component analysis. Although this methodology accomplishes it’s objective of forecasting, we consider that it falls short to provide insight on air pollution dynamics, which is the main purpose of this work.

The paper by Sotomayor-Olmedo et al. [8] is a fine example of two other methods commonly used in

forecasting setting, namely, Support Vector Machines (SVM) and kernel functions. They obtained good

accuracy results in modelling pollutant concentrations of O3, NO2 and PM10 in Mexico City.

Another popular approach are Classification and Regression Trees (CART) methods, first introduced by

Breiman, Friedman, Olshen and Stone in 1984 [1]. CART are pattern recognition methods for constructing

prediction models from data. The models are obtained by recursively partitioning a learning sample via binary questions on the prediction parameters, so the result can be interpreted graphically as a binary tree that fits a simple prediction rule on it’s terminal nodes. CART methods have been applied in a diverse range of settings, from medical diagnosis and prognosis to air quality. For example, F. Bruno, D.

Cocchi and C. Trivisano use this approach to forcast daily exceedance of ozone standards in Italy [3].

We are going to use CART methods mainly for three reasons, listed in order of importance:

1. The outcome of CART is a binary tree model that provides a graphical map of the relation between the parameters. Since we are interested in understanding the dynamics of PM10 pollution, this output is an ideal result.

2. CART is able to rank the input variables by their pertinence in effectively classifying the output variable. When applied to air pollution, we believe this to be a very important result on its own, since it allows to determine the critical variables that impact air quality on a particular location. This characteristic allows us to deduce important conclusions from simple observations of data.

3. CART algorithms are already implemented in the statistics computer software R. The libraryrpart

includes several methods that simplify the calculations and provide the user with graphical outputs of the models.

(6)
(7)

2

CART

We explain the main concepts behind the method. The objective of this section is to enable the reader to

interpret the results of this thesis. For a detailed exposition of this subject see [1, Chapter 3], [7, Chapter

14] or [6, Chapter 8].

2.1 Classification Trees

Consider a learning data sample ofnobservations of a variableY and predictor variablesX~ = (X1, ..., Xr)

inX, anr-dimensional measurable space. Y takes values inC ={1,2, ..., C}different classes. The model

constructs a classification rule T :X → C using the learning sample. The classification rule is a binary

tree T grown via binary questions on the sample:

A

AL

Xi≤k

AR

Xi > k

The tree growing process follows three general steps:

1. A single variable which “best” splits the data is found.

2. The data is separated and the same process is applied separately to each node.

3. The process continues until the nodes reach a previously defined minimal size, or until no improve-ment can be made.

The best split is defined to be the question that selects the best possible segregation of the classes. The

goodness of a split is evaluated in terms of a node impurity functionf that attains it’s minimum when

all the objects in the node belong to the same class, and it’s maximum when the proportion of each class

is equal within a node. If f is an impurity function, the impurity of the nodeA is defined as

I(A) =

C X

i=1

f(pi,A) (1)

where pi,A is the proportion of the class iin the node A. Since we would like to haveI(A) = 0 when A

is pure, f must be concave withf(0) =f(1) = 0. Two options for the impurity function are:

1) the information indexf(p) =−plog(p).

2) the Gini index f(p) =p(1−p).

In these terms the best split will be the one with maximal impurity reduction

∆I =p(A)I(A)−p(AL)I(AL)−p(AR)I(AR) (2)

Notice that the splitting rule is fundamentally local, so it is possible to partition the learning sample

until every leaf node in the tree T contains a single datum. These leads to over sized trees that are

unlikely to be generalized. In order to build a right sized tree one needs to define a pruning rule. The

CART pruning rule is based on a cost-complexity measure. As stated in [1], letR∗(T) be the true overall

misclassification rate, that is the probability of misclassifying the new sample given the learning sample.

R∗(T) is estimated using the resubstitution estimate. If the learning sample is

(8)

and 1(·) is the indicator function defined to be one if the statement inside the parenthesis is true and zero otherwise, then

R(T) := 1

n n X

j=1

1(T(X~j)6=Yj), (3)

whereT(X~j) is the class assigned by the treeT to the predictor variableX~j with true class Yj. The main

problem ofR(T) is that it is calculated using the same data used to constructT, instead of an independent

sample. In comparison to a regression model, the number of nodes denoted by |T| is analogous to the

model degrees of freedom and R(T) to the residual sum of squares.

The CART cost-complexity measure is defined as

Rα(T) =R(T) +α|T|, α≥0 (4)

whereαis called the complexity parameter. It turns out to be possible to determine the smallest subtree

of the complete model for which Rα(T) is minimized, this tree is named Tα. Moreover, if α > β then

Tα is a subtree ofTβ. ClearlyT0 =T and T∞ is the model with no splits. [1] proves that there exists a

nested sequence of trees{T0, Tα1, Tα2, ..., T∞}such that each tree is optimal for a range ofα. Afterwards,

an algorithm called cross-validationis executed in order to choose the best value forα. The pruned tree

is therefore Tα, were α is the optimal complexity parameter.

2.2 Handling of Missing Data

Missing values are one of the problems of statistical models. Most procedures deal with them by ignoring

them, but rpartis a little bit more ambitious. Any observation with values for the dependent variable

and at least one independent variable will participate in the modeling. The quantity to be maximized is still equation (2). The leading term is the same for all variables and splits irrespective of missing data,

but the two right terms are modified. Firstly, the impurity indexesI(AR) andI(AL) are calculated only

over the observations which are not missing a particular predictor. Secondly, the two probabilitiesp(AL)

and p(AR) are also calculated only over the relevant observations, but they are then adjusted so that

they sum to p(A). This implies some extra accounting and computational cost as the tree is built, but ensures that the terminal node probabilities sum to 1.

(9)

Once a splitting variable and a split point for it have been decided the following step is to estimate the

missing datum using the other independent variables; rpartuses a variation of this to define surrogate

variables. We explain this by the way of an example.

As it is shown in figure 2, the chosen split for node number 1 was Accum.Prcip.Bef.Max < 0.05 with

10 missing values. Instead of ignoring them, rpart estimates the outcome of the missing datum using

other variables, in this case Mean.Prcip, Max.Prcip and Accum.Prcip, which makes sense since these

variables are probably correlated.

2.3 Hierarchy of Variable Importance

The measure of variable importance used is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness for all splits in which it was a surrogate. In the printout these are scaled to sum to 100 and the rounded values are shown, omitting any variable whose proportion is less than 1%.

2.4 Node numbering

The way we number the nodes on a binary tree is straightforward. The root node is always node number

1. For node number n, the left child node is number 2n and the right child node is number 2n+ 1. For

example:

1

2

4 5

10 11

3

6 7

(10)

3

Data

As mentioned above, the data used for this work comes from the BAQMN and the airport’s radio probe databases in a three year period: 2009, 2010, and 2011. The original database counts with 18 stations.

The list of stations and variables is in figure 3. Each station contains more than ten years of hourly

measurements of the variables listed in table 3 and some others. We concentrate our study on eight of

these stations namely: E2, E5, E7, E9, E10, E13, E15 and E18 since they represent the general results of the other stations, are the ones with more complete data and are uniformly distributed in the city among

the available stations (see figure 1).

We focused on PM10 as our outcome variable for several reasons. First of all because of its relevance

in urban areas, as it was mentioned in the first paragraph of the introduction. Secondly, this pollutant has the technical advantage that it is the more complete variable of the BAQMN database. Initially, we

considered one output variable: Mean.PM10which is the daily average of PM10in a particular station. A

second analysis is to repeat the same process for Max.PM10, i.e daily maximum PM10.

In order to understand the reason of our selection of input variables, we need to explain some remarks

about the daily behaviour of PM10 in Bogot´a city. PM10 concentration is directly influenced by traffic

dynamics. Specifically in Bogot´a, 65% of atmospheric pollutants come from vehicular emissions [4]. As it

is shown in figure 3, the effect of heavy traffic early in the morning is that the daily maximum is almost

always attained between 7:00 and 8:00 a.m. Afterwards, between 11:00 and 19:00, PM10 concentrations

are almost constant followed by a second, less significant, local maximum between 20:00 and 22:00 hours. The first peak is significantly higher than the second one because the surface stable layer present in the morning grows thicker with time, allowing a better mixture of pollutants later in the day. Therefore,

daily mean and maximum PM10are indicators of the concentrations that people are exposed in a typical

day; the first one indicates what scales of concentrations are people exposed for a longer time (11:00 to 19:00 hours) and the second one indicates the high risk concentrations attained in the morning (7:00 to 8:00 hours).

0 2 4 6 8 10 13 16 19 22

0

50

100

150

Hour

PM10

(

µ

g/

m

3 )

(11)

ID BAQMN Station ID BAQMN Station

E1 Cade Energ´ıa E10 Las Ferias

E2 Carvajal - Sevillana E11 Ministerio de Ambiente

E3 Cazuc´a E12 Olaya

E4 Central de Mezclas E13 Puente Aranda

E5 Centro de Alto Rendimiento E14 San Crist´obal

E6 Chic´o Lago (Sto. Tom´as) E15 Suba

E7 Fontib´on E16 Tunal

E8 Guaymaral E17 Universidad Nacional

E9 Kennedy E18 Usaqu´en

Table 1: BAQMN air quality monitoring stations.

Pollutants Meteorological

PM10 Relative Humidity

PM2.5 Precipitation

N Ox Wind direction

SO2 Wind speed

O3 Barometric pressure

Table 2: Pollutants and meteorological variables measured in BAQMN.

ID Variable Type Description

Y1 Mean.PM10 Class Seven categories defined by1/7-th

percen-tiles: A,B,C,D,E,F,G where A is the top 85.71% and G is the bottom 14.28%.

X1 Mean.WndSp Numeric Daily mean wind speed. (m/s)

X2 Mean.WndDir Numeric Angle of the mean wind direction. (Degrees)

X3 Mean.WndDirCat Class Mean wind direction classified into 8

cate-gories: N, NE, E, SE, S, SW, W and NW.

X4 Mean.T Numeric Daily mean temperature. (oC)

X5 Mean.Prcip Numeric Daily mean precipitation. (mm)

X6 Accum.Prcip Numeric Daily accumulated precipitation. (mm)

X7 Accum.Prcip.Bef.Max Numeric Accumulated precipitation until the time of

maximum PM10. (mm)

X8 Day.Type Class Four types of days: weekday, Holiday,

Satur-day and SunSatur-day.

X9 Mean.PM10.1DB Numeric Mean PM10 one day before. (µg/m3)

X10 Max.PM10.1DB Numeric Max. PM10 one day before. (µg/m3)

X11 PWAT Numeric Precipitable Water (mm)

X12 dH.Inv.Sup Numeric Surface inversion thickness (m)

X13 dH.Est.Sup Numeric Stable surface layer thickness (m)

X14 PBLH Numeric Planetary boundary layer height (m)

(12)

Station 14.28% 28.57% 42.85% 57.14% 71.43% 85.71%

E2 73.26 83.08 91.50 99.83 109.20 119.44

E5 21.32 30.31 36.92 44.86 51.98 61.23

E7 41.62 48.72 55.80 63.12 69.65 77.93

E9 63.00 71.50 81.33 89.94 100.05 113.26

E10 29.68 35.42 41.59 47.25 53.81 62.27

E13 37.96 48.35 56.26 63.46 72.63 83.78

E15 45.59 50.83 55.12 59.50 64.65 74.81

E18 31.83 42.12 50.36 58.00 67.96 80.54

Table 4: 1/7th-percentiles used in the mean PM10category definitions for each studied station from 2009

to 2011.

3.1 Radiosonde data

The airport’s radio probe measures temperature, relative humidity, and wind velocity at different heights (pressure levels). Measurements are taken every day at 7:00 a.m, which is close to the time of maximum

PM10concentration in the city. Therefore, we may use this variables as predictors forMax.PM10. Several

variables related to atmospheric stability can be constructed with the data, we constructed four:

• PWAT:The total precipitable water is the water contained in a column of unit cross section extending

all of the way from the earth’s surface to the “top” of the atmosphere. Mathematically, if f(p) is

the mixing ratio at the pressure level p, then

PWAT= 1

ρg

Z 0

p0

f(p)dp,

where p0 is the atmospheric pressure, ρ represents the density of water and g is the gravitational

constant.

• PBLH:It was defined as the height of the first non-superficial thermal inversion. That is, the height

of the first layer not adjacent to the surface with −dT /dz <0.

• dH.Inv.Sup: A surface layer was defined as a layer adjacent to the surface such that the

tempera-ture gradient−dT /dzis negative, that is, a superficial layer for which the temperature increases with

height. In principle, there may be several adjacent layers that satisfy this condition. dH.Inv.Sup

was defined as the sum of the heights of these adjacent layers.

• dH.Est.Sup: A stable atmospheric layer was defined as every surface adjacent layer for which

−dT /dz <4K/km. Therefore, every surface layer lies inside a stable layer. Recall that atmospheric

stability can be estimated by comparing the temperature gradient with the dry adiabatic rate

Γd = 9.8K/km. Furthermore, the wet adiabatic rate for Bogot´a is Γw = 4.5K/km. The value

4K/kmwas chosen so that the stable layer would be stable with respect to both Γd and Γw.

When studying the results, it is important to take into account that these information may not be accurate

for stations located far away from the airport. Let C denote some pollutant’s concentration. Since the

higher the boundary layer, the more difficult it is for the pollutants to escape to the free atmosphere,

PBLH is expected to be proportional to the pollutants concentration PBLH ∝C. The concentration C =

m/(dH.Inv.Sup×A), therefore as the surface layer gets higher, concentrations get smaller: dH.Inv.Sup∝

1/C. The same behaviour is expected from dH.Est.Sup. Finally, if there is total precipitable water is

small there is little water vapour in the atmosphere. Since water vapour absorbs radiation, one would

expect that lowPWATimplies greater cooling in superficial layers and therefore, lowPWATimplies a greater

(13)

PWAT (mm)

Frequency

0 5 10 15 20 25 30

0

50

100

150

200

250

PBLH (m)

Frequency

0 1000 2000 3000 4000 5000

0

50

100

150

dH.Est.Sup (m)

Frequency

0 500 1000 1500

0

200

400

600

dH.Est.Sup (m)

Frequency

0 100 300 500 700

0

200

400

600

(14)

4

Methodology

The methodology of this thesis consists on two parts. Part 1 concerns the local analysis of each one of the selected stations. We construct the classification tree for each station, identify local relations between predictor variables and summarize the fundamental conclusions. In part 2, we identify similarities and

differences between the stations in order two extract general conclusions for the PM10 dynamics of the

whole city.

4.1 Methodology Justification

The methodology of this thesis is a result of a previous step in which we explored different approaches to extract the information from the data. In this section we explain the process that led to our final R script and also discuss some interesting preliminary results.

First of all, we focused on a single station. We chose station E2 because it is the one with less missing data in the BAQMN database for the period 2009 to 2011. On this first test we didn’t use the data from the airport’s radio probe.

We classified mean PM10 into 7 categories (1/7-quartiles) and we chose out of curiosity Mean.WndSp,

Max.WndSp, Mean.WndDirCat, Mean.T, Accum.Prcip.Bef.Max,and Day.Typeas predictive variables.

Initially, we classified Mean.WndDirCat into 36 categories, which turned out to be a mistake since the

resulting classification trees where unreadable and the level of precision was unnecessary. Therefore, we

classified it instead into the eight principal cardinal directions. The result was figure 6.

(15)

Day.Type (50%), Accum.Prcip.Bef.Max (21%), Mean.WndSp (13%) and Mean.WndDirCat (6%) turned out to be the decisive variables. This first tree has some very interesting results:

• As it was expected, Sundays and holidays fall into the lowest category irrespective of meteorological

conditions. This is likely due to the reduced traffic emissions.

• In node 2, the days with no rain before the time of maximum concentration are classified at the

highest category in contrast with rainy days before the time of maximum concentration that are classified as E.

• In node 4, days with mean wind speed lower than 2.8 m/s are classified as A.

For experiment number 2, we decided to include data from previous days into the analysis to see if they

had any predictive value, specifically we introduced the variables Max.PM10.1DBand Accum.Prcip.1DB.

The result was figure 6.

Figure 6: Second experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011.

Note that figure 6 has the same tree structure as figure 5, the main difference being that Mean.WndSp

(16)

previous day are relevant. We affirm that this new tree (figure 6) is better than the first one (figure 5) since the second one contains all the predictive variables from the first one.

For the third experiment, we decided to include all the predictive variables at our disposal. In addition to

the ones already mentioned, we included PWAT, dH.Inv.Sup, dH.Est.Supand PBLH from the airport’s

radio probe and also Mean.PM10.1DB and Accum.Prcip.Bef.Max.1DB from the BAQMN database for

station E2. The result is shown in figure 7.

Figure 7: Third experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011.

Although there are several changes, the main structure from the tree is preserved. Comparing figure7with

figures 5 and 6, the main structural difference is that Mean.PM10.1DB replaces Accum.Precip.Bef.Max

in node number 2. The sub-tree with root node 5 in figure 7 has the same shape as the sub-tree with

root node 2 in figures 5and 6.

18.35% of the days in these three years are Sundays or holidays, exactly 201 days from 1095. Since all of the experiments until now have classified these types of days in class G on the first split, we decided to

(17)

Figure 8: Fourth experiment. CART for E2 - “Carvajal” for data in the period 2009 to 2011. Same as7

but with Sundays and Holidays excluded.

Removing Sundays and holidays from the tree construction contributes new information the analysis. Two new splits arise in contrast with experiment number three. The firs one in node number 11: Saturdays are sensibly cleaner than weekdays, under certain conditions.

(18)

5

Results

In this section we start with part 1 of the methodology, that is the local analysis of each station.

5.1 Daily Average PM10 CART for station E2 - “Carvajal”.

Variable Importance %

Accum.Prcip.Bef.Max 22

Mean.PM10.1DB 14

Mean.WndDir 10

Mean.WndDirCat 9

Max.PM10.1DB 6

Max.T 6

Table 5: Variable importance Average PM10 CART, station E2-“Carvajal”

Mean pm10 µg/m3

Frequency

0 50 100 150 200

0

50

100

150

Figure 9: Mean PM10 histogram, station E2-“Carvajal”.

Looking at node 1, we can infer our first local conclusion:

LC1) The probability of having a polluted day given that it rained before the time of maximum PM10

(that is, before 7:00 a.m) is less than 0.3. In general, previous day pollution is directly proportional to present day pollution.

Going down to the left to node number 2, observe in figure9that 108µg/m3is close to the 75-th percentile

for this station. In fact, the 5/7-percentile for this station is 109.2µg/m3. This leads to our second local

conclusion:

LC2) The probability of having a clean day (that is being classified as E, F or G) given that the previous day was classified B or higher and it didn’t rain before 7:00 a.m, is close to 10%. In general, rain before the time of maximum pollution reduces daily average PM10 concentrations.

(19)

In order to understand the results from node 3, is important to recall that station E2 is located at the

center-west side of the city, see figure 1. Given the shape of the city, it makes sense that wind currents

that comes from the west or east are cleaner than those that come from the south or the north.

LC3) Wind currents from the East and Northwest are cleaner than the others at this station.

For nodes 5 and 11, notice that dH.Est.Sup = 168 m is a very high value and PBLH = 336 m is an

extremely low value (see figure 4). For very low values of PBLH the proportional relation explained in

section 3.1 is no longer valid, and this variable becomes inversely proportional to PM10 concentrations.

Therefore the results make sense, at least for node 11.

Another result (from node 23) that will be discussed in station E18 is the following:

LC4) Max.T is directly proportional to PM10 mean concentration.

Accum.Prcip.Bef.Max < 0.05

Mean.PM10.1DB >= 108

dH.Est.Sup >= 168

PBLH < 336

Max.T >= 20

Mean.WndDirCat = N,NE,S,SE,SW,W

A .14 .14 .14 .14

.14 .14 .14 100%

A .19 .18 .17 .16

.13 .10 .07 72%

A .38 .25 .15 .12

.05 .03 .03 23%

D .11 .14 .17 .18

.17 .14 .09 50%

B .18 .23 .20 .17

.09 .08 .06 19%

E .06 .09 .16 .18

.21 .18 .11 31%

C .04 .11 .41 .22

.04 .11 .07 3%

E .06 .09 .13 .18

.23 .19 .12 28%

E .06 .12 .16 .19

.25 .12 .10 18%

F .08 .04 .09 .15

.19 .32 .14 9%

G .02 .05 .08 .10

.17 .24 .34 28%

F .01 .06 .11 .12

.21 .30 .19 17%

G .03 .03 .04 .07

.11 .14 .57 11%

yes no

(20)

5.2 Daily Average PM10 CART for station E5 - “Centro de Alto Rendimiento”.

Variable Importance %

Mean.WndDir 25

Mean.WndDirCat 22

Mean.PM10.1DB 16

Max.PM10.1DB 10

Mean.WndSp 6

dH.Est.Sup 6

Table 6: Variable importance Average PM10 CART, station E5 - “Centro de Alto Rendimiento”

Mean pm10 µg/m3

Frequency

0 50 100 150

0

20

40

60

80

100

Figure 11: Mean PM10 histogram, station E5 - “Centro de Alto Rendimiento”.

Observe in table 6 that wind turns out to be one of the most relevant variables. By the way in which

variable importance is calculated (see section 2.3), it makes sense that two similar variables, such as

Mean.WndDirandMean.WndDirCat, are both ranked very high. Nodes 2 and 3 are another occurrence of LC1 of observed in the previous tree, that is, previous day pollution is directly proportional to present day pollution. On node 5, one can observe another occurrence of LC2 from tree E2: rain before the time

of maximum pollution reduces daily average PM10concentrations.

Interestingly, PWAT appears as a split in node 4 even though it is not ranked in the top six important

variables. Observe in figure 4 that 20 mm is close to the 50th percentile of this variable. Therefore, we

have that when total precipitable water is low, there are more thermal inversions, which is consistent with higher pollution:

(21)

Mean.WndDir >= 178

Mean.PM10.1DB >= 35

PWAT < 20

Mean.PM10.1DB >= 70

Accum.Prcip.Bef.Max < 0.15

Mean.PM10.1DB >= 31

Max.PM10.1DB < 130 dH.Est.Sup >= 252

A .14 .14 .14 .14

.14 .14 .14 100%

A .21 .20 .19 .17

.14 .08 .01 68%

A .27 .24 .19 .14

.11 .04 .01 49%

A .41 .29 .14 .11

.03 .01 .00 19%

C .18 .21 .22 .16

.16 .07 .01 31%

A .57 .22 .17 .04

.00 .00 .00 3%

C .14 .21 .22 .17

.17 .07 .01 28%

D .05 .10 .18 .25

.22 .18 .01 19%

D .06 .12 .22 .31

.19 .09 .01 13%

F .04 .04 .09 .11

.31 .38 .02 6%

G .00 .02 .05 .07

.15 .28 .44 32%

F .00 .03 .10 .14

.22 .30 .20 12%

F .00 .01 .08 .17

.25 .36 .13 10%

G .00 .10 .19 .05

.14 .05 .48 3%

G .00 .01 .02 .03

.10 .27 .58 20%

F .00 .00 .00 .12

.33 .50 .04 3%

G .00 .01 .02 .02

.05 .23 .68 17% yes no A C D F G

(22)

5.3 Daily Average PM10 CART for station E7 - “Fontib´on”.

Variable Importance %

Mean.PM10.1DB 21

Max.PM10.1DB 13

dH.Est.Sup 11

PWAT 9

Accum.Prcip.Bef.Max 6

Mean.WndSp 6

Table 7: Variable importance Average PM10 CART, station E7 - “Fontib´on”

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ● ● ●

2

3

4

5

Mean Wind Speed for E7.

(m/s)

● ●

3

4

5

6

7

8

9

Max Wind Speed for E7.

(m/s)

Figure 13: Mean.WndSp and Max.WndSpboxplots, station E7.

Mean pm10 µg/m3

Frequency

0 50 100 150

0

20

40

60

80

100

(23)

Observe that the airport’s variablesdH.Est.Supand PWATare very relevant for this station, unlike other stations. This may be related to the fact that this is the closest station to the airport amongst the ones

studied here. 62 µg/m3 is close to the 50th percentile of mean PM10 for this station. As it happened

with station E2, notice that having a previous day contaminated day raises the probability of a present contaminated day.

LC1) Previous day mean concentration of PM10 is directly proportional to present day mean PM10

concentration.

A new result from this tree concerns the splits in node 6 and 13. In both cases, the split value is close the

third quartile, look at figure 13. In both nodes, low mean wind speeds (max. wind speeds respectively)

result in cleaner PM10 classes.

LC6) Wind speeds (mean or max.) lower than the 3rd quartile result in cleaner mean PM10 classes.

Mean.PM10.1DB >= 62

PWAT < 21 dH.Est.Sup >= 16

Mean.WndSp < 3.3

Accum.Prcip.Bef.Max < 0.15 Max.WndSp >= 6.7

A .14 .14 .14 .14

.14 .14 .14 100%

A .25 .21 .18 .13

.09 .08 .04 39%

A .33 .25 .16 .13

.07 .03 .03 27%

C .07 .14 .24 .14

.14 .18 .08 12%

G .07 .10 .12 .15

.17 .18 .21 61%

D .09 .13 .16 .19

.15 .15 .12 33%

C .13 .15 .20 .19

.13 .11 .08 24%

C .14 .17 .23 .22

.11 .10 .03 21%

G .03 .07 .07 .07

.23 .20 .33 4%

F .00 .08 .03 .18

.21 .25 .24 9%

D .00 .15 .05 .33

.21 .15 .10 5%

G .00 .00 .00 .00

.22 .38 .41 4%

G .05 .05 .07 .10

.20 .23 .31 27% yes no A C D F G

(24)

5.4 Daily Average PM10 CART for station E9 - “Kennedy”.

Variable Importance %

Mean.PM10.1DB 31

Max.PM10.1DB 19

dH.Est.Sup 10

dH.Inv.Sup 6

Mean.T 6

Max.T 6

Table 8: Variable importance Average PM10 CART, station E9 - “Kennedy”

Mean pm10 µg/m3

Frequency

0 50 100 150 200

0

20

40

60

80

Figure 16: Mean PM10 histogram, station E9 - “Kennedy”.

First of all recall that this station is located in the central west side of the city (see1). In fact, this station

is also close to the airport. Just as in station E7, previous day mean and maximum lead the ranking and

radiosonde variables appear in the top four, in this case: dH.Est.Sup with 10% and dH.Inv.Sup with

6% replaces PWATin table7.

Note that nodes 1, 2, 6, 7 and 28 are occurrences of LC1, while node 13 is an occurrence of LC2. Moreover, a new expected conclusion can be inferred from nodes 3 and 114:

LC7) dH.Est.Sup anddH.Inv.Sup are inversely proportional to daily mean PM10 concentrations.

On the other hand, observe that in nodes 12 and 57 there is a similar behaviour to LC4. Temperature

(25)

Mean.PM10.1DB >= 98

Mean.PM10.1DB >= 112

dH.Est.Sup >= 84

Max.PM10.1DB >= 134

Max.T >= 20

Accum.Pr

cip.Bef

.Max < 0.05

Mean.PM10.1DB >= 55

Mean.WndSp < 3.3

Max.PM10.1DB >= 187

Mean.T < 14

dH.In

v.Sup >= 16

Mean.WndDir >= 251

D

.14 .14 .14 .14 .14 .14 .14

100%

A

.37 .27 .17 .06 .06 .05 .01

27%

A

.56 .24 .09 .04 .02 .02 .02

12%

B

.22 .28 .24 .08 .09 .08 .01

15%

G

.06 .10 .13 .17 .17 .18 .19

73%

D

.09 .16 .18 .24 .16 .10 .08

28%

D

.10 .24 .16 .30 .15 .06 .01

16%

B

.14 .30 .30 .11 .14 .00 .03

5%

D

.08 .21 .09 .38 .15 .08 .00

11%

C

.07 .05 .20 .16 .18 .16 .17

12%

C

.08 .07 .25 .18 .17 .18 .07

9%

G

.05 .00 .05 .09 .23 .09 .50

3%

G

.04 .06 .10 .14 .18 .23 .26

45%

F

.05 .06 .11 .15 .18 .24 .21

39%

F

.05 .07 .12 .16 .19 .24 .17

35%

E

.04 .04 .04 .18 .54 .07 .11

4%

F

.05 .08 .13 .16 .15 .26 .18

31%

G

.08 .16 .13 .10 .08 .18 .27

12%

B

.15 .40 .15 .05 .05 .15 .05

3%

G

.06 .09 .13 .12 .09 .19 .33

9%

F

.04 .03 .13 .19 .19 .30 .12

20%

D

.07 .01 .13 .30 .16 .19 .13

9%

F

.01 .04 .12 .10 .21 .40 .11

10%

G

.00 .00 .03 .03 .12 .28 .53

4%

G

.00 .02 .05 .05 .14 .12 .62

6%

yes

no

(26)

5.5 Daily Average PM10 CART for station E10 - “Ferias”.

Variable Importance %

Mean.PM10.1DB 22

Mean.WndDirCat 17

Mean.WndDir 13

Max.PM10.1DB 13

Mean.WndSp 10

Accum.Prcip 6

Table 9: Variable importance Average PM10 CART, station E10-“Ferias”

Mean pm10 µg/m3

Frequency

0 50 100 150

0

20

40

60

80

100

120

Figure 18: Mean PM10 histogram, station E10-“Ferias”.

It is worthy of recognition that the first split of the tree is a question on the mean win direction. Once again, the phenomena of cleaner wind currents coming from the eastern mountain region is evident. Although there is a 21% of dirty days in node 3, there is a clear segregation. In node 2 there is also a 21% proportion of clean days.

LC3) Wind currents from the East and Southwest are cleaner than the others at this station.

Looking at nodes 2, 3 and 6, it is clear that previous contaminated days raise the probability of a contaminated present day.

LC1) Previous day mean concentration of PM10 is directly proportional to present day mean PM10

concentration.

The split in node 13 is very interesting. Note that 18 mm is a little bit below the 50th percentile ofPWAT

(see figure 4). The split in this node is stating the same result as node 2 in the tree of station E7:

(27)

It is worth noticing that the split on node 27 contradicts the behaviour displayed by temperature on trees E2 and E9. Here, lower temperatures result in more polluted days.

Mean.WndDirCat = N,NW,S,W

Mean.PM10.1DB >= 43 Mean.PM10.1DB >= 32

Mean.PM10.1DB >= 49

PWAT < 18

Mean.T < 14 A

.14 .14 .14 .14 .14 .14 .14

100%

A .25 .22 .19 .13

.12 .07 .02 48%

A .37 .25 .16 .11

.08 .03 .00 30%

C .06 .15 .25 .16

.18 .13 .07 18%

G .04 .07 .10 .16

.16 .21 .25 52%

E .06 .10 .13 .19

.21 .20 .12 36%

D .14 .18 .15 .23

.17 .09 .04 13%

F .02 .06 .11 .17

.22 .26 .16 23%

E .00 .05 .16 .26

.31 .17 .05 7%

F .02 .06 .09 .13

.19 .30 .21 16%

F .04 .06 .12 .18

.14 .35 .12 10%

G .00 .06 .06 .06

.26 .21 .36 6%

G .00 .01 .04 .09

.07 .23 .55 16% yes no A C D E F G

(28)

5.6 Daily Average PM10 CART for station E13 - “Puente Aranda”.

Variable Importance %

Mean.WndDirCat 19

Mean.WndDir 18

Mean.PM10.1DB 17

Mean.WndSp 11

Max.PM10.1DB 8

dH.Est.Sup 7

Table 10: Variable importance Average PM10 CART, station E13 -“Puente Aranda”

Mean pm10 µg/m3

Frequency

0 50 100 150

0

20

40

60

80

100

Figure 20: Mean PM10histogram, station E13 - “Puente Aranda”.

This station shows the same behaviour as the other stations located at the east of the city studied until now (E5 and E10): Wind variables are clearly decisive and clean wind directions coincide with the position of the mountains (see nodes 1 and 41). In this case, clean wind directions are S, SE and E as you can see in the first split of the tree.

This tree has the particular characteristic that almost every type of split seen up to this moment happens again. For example, nodes 2, 3 and 40 are an example of LC1. Nodes 5 and 6 contribute evidence in favour of LC7. Yo can see in node 21 yet another example of LC4. Finally, the split in node 20 agrees with LC6.

(29)

Mean.WndDirCat = N,NE,NW

,SW

,W

Mean.PM10.1DB >= 84

dH.Est.Sup >= 5.5

Mean.WndDir >= 275

PW

A

T < 20

Mean.PM10.1DB >= 54

Mean.PM10.1DB >= 42

Max.T >= 19

Mean.PM10.1DB >= 53

dH.Est.Sup >= 84

A

.14 .14 .14 .14 .14 .14 .14

100%

A

.19 .19 .17 .16 .13 .11 .04

73%

A

.46 .26 .16 .04 .05 .02 .00

11%

D

.14 .18 .18 .18 .15 .13 .05

62%

B

.17 .21 .18 .19 .12 .09 .03

49%

B

.23 .25 .17 .19 .10 .05 .01

34%

A

.37 .31 .13 .11 .06 .02 .01

15%

A

.53 .22 .10 .10 .04 .00 .00

8%

B

.19 .41 .17 .12 .07 .03 .02

7%

D

.12 .21 .20 .24 .13 .08 .02

19%

D

.13 .22 .22 .27 .08 .07 .01

17%

E

.05 .14 .10 .05 .48 .14 .05

2%

D

.03 .12 .19 .20 .19 .19 .07

14%

C

.07 .09 .37 .23 .12 .09 .02

5%

F

.01 .14 .09 .18 .23 .24 .10

9%

F

.04 .04 .17 .14 .24 .25 .12

13%

G

.02 .02 .06 .10 .16 .23 .41

27%

F

.03 .03 .12 .21 .22 .24 .14

10%

E

.02 .07 .14 .29 .33 .12 .02

5%

F

.05 .00 .09 .14 .11 .36 .25

5%

G

.01 .01 .03 .03 .13 .22 .57

17%

yes

no

(30)

5.7 Daily Average PM10 CART for station E15 - “Suba”.

Variable Importance %

Mean.PM10.1DB 35

PWAT 13

Max.PM10.1DB 13

Mean.Prcip 7

Accum.Prcip 7

dH.Inv.Sup 6

Table 11: Variable importance Average PM10 CART, station E15 - “Suba”

Mean pm10 µg/m3

Frequency

0 50 100 150 200

0

50

100

150

200

250

Figure 22: Mean PM10 histogram, station E15 - “Suba”.

This station the one located further to the north from all of the eight stations studied here. Strangely,

PWAT appears as the second variable in importance for the construction of the tree. Also, precipitation

plays an important role in the construction process. Not accumulated precipitation before the time of

maximum PM10 as in other cases, butMean.Prcip and Accum.Prcip.

There are two other facts that separate this stations from the other ones. The first one is that the clean air directions are practically inverted: clean air comes from the range between W and SE. The second is an

irregularity in node 6, which contradicts LC7, since it suggests that dH.Inv.Supis directly proportional

to mean PM10 concentration.

The rest of the splits are similar to the ones encountered before. Nodes 1 and 28 contribute evidence in favour of LC1. Nodes 3 and 56 do the same for LC6. Finally, node 14 shows a similar behaviour to LC2.

(31)

Mean.PM10.1DB >= 66

PWAT < 20

dH.Inv.Sup >= 30 Mean.WndDirCat = E,N,NE,NW

Mean.Prcip < 0.15

Mean.PM10.1DB >= 51

PWAT < 22

A .14 .14 .14 .14

.14 .14 .14 100%

A .46 .26 .15 .07

.04 .01 .02 23%

F .05 .11 .14 .16

.17 .18 .18 77%

C .08 .16 .21 .20

.15 .13 .07 34%

C .09 .21 .31 .15

.13 .08 .02 16%

D .07 .12 .13 .25

.17 .16 .10 19%

G .02 .07 .09 .13

.19 .23 .27 43%

F .03 .08 .09 .14

.21 .23 .22 37%

F .04 .09 .11 .15

.22 .25 .14 26%

E .05 .16 .14 .19

.23 .14 .10 14%

D .08 .16 .23 .24

.13 .13 .03 8%

E .00 .17 .02 .13

.35 .15 .19 6%

F .03 .01 .08 .11

.21 .38 .18 12%

G .00 .03 .06 .11

.19 .19 .41 11%

G .00 .00 .05 .05

.07 .19 .64 5% yes no A C D E F G

(32)

5.8 Daily Average PM10 CART for station E18 - “Usaqu´en”.

Variable Importance %

Mean.WndDir 25

Mean.WndDirCat 18

Mean.PM10.1DB 11

Max.PM10.1DB 11

Max.WndSp 9

Accum.Prcip 6

Table 12: Variable importance Average PM10 CART, station E18 - “Usaqu´en”

Mean pm10 µg/m3

Frequency

0 50 100 150 200

0

50

100

150

200

Figure 24: Mean PM10 histogram, station E18 - “Usaqu´en”.

Recall that station E18 is located at the north west of the city, close to the mountain border on the west. The first split of this tree and the split on node 7 confirm previous observations:

LC3) Wind currents from the mountains are cleaner than the currents that come from other directions for this station.

Once again, we observe in node 2 and 3 one of the most recurrent conclusions:

LC1) Previous day mean concentration of PM10 is directly proportional to present day mean PM10

concentration.

Furthermore, node 6 confirms conclusion 4 for station E10:

LC6) PWAT is inversely proportional to PM10 mean concentration.

The role of temperature in this tree and the one for station E2 is not very well understood. Daily maximum temperature is normally attained around noon, and it is clear that high temperatures in the

(33)

morning affect the structure of superficial layers. Never the less, it is not clear why hot days result in cleaner classes. Nevertheless, it is a result that may have predictive value and we recommend further analysis on this issue.

LC4) Max.T is inversely proportional to PM10 mean concentration.

Mean.WndDir >= 207

Max.PM10.1DB >= 152 Mean.PM10.1DB >= 46

PWAT < 24

Max.T >= 17

dH.Est.Sup >= 400

Mean.WndDir >= 181

C .14 .14 .14 .14

.14 .14 .14 100%

A .32 .25 .19 .12

.09 .03 .01 40%

A .60 .12 .12 .07

.07 .01 .00 10%

B .22 .30 .21 .13

.10 .03 .01 30%

G .03 .07 .12 .16

.18 .22 .23 60%

D .05 .13 .18 .24

.19 .14 .07 29%

D .05 .14 .19 .24

.16 .15 .07 27%

D .06 .15 .20 .25

.16 .11 .07 24%

C .16 .23 .32 .11

.16 .02 .00 5%

D .03 .13 .16 .29

.16 .13 .09 19%

F .00 .04 .15 .15

.12 .46 .08 3%

E .00 .00 .00 .17

.75 .00 .08 1%

G .00 .01 .06 .09

.17 .29 .38 31%

F .00 .01 .12 .20

.24 .31 .12 10%

G .01 .01 .03 .03

.14 .28 .50 21%

yes no

(34)

6

Conclusions

The CART method proved to produce logical results. Most of them are the ones expected by anyone with common sense and a little bit of meteorological knowledge. Others, need a little bit of further analysis. The principal general conclusions are the following:

0. Sundays and Holidays are the best days in terms of PM10concentrations. They are always classified

as G, and although this is not a new result it is important to emphasize the fact that the effect of traffic is determinant.

1. Wind direction and speed, mean and maximum PM10 values from one day before and cumulative

precipitation are the key variables present in almost all of the stations.

2. We found consistent repetitions in different circumstances of the fact that high values accumulated

precipitation before the time of maximum PM10 result in cleaner categories. This represents

sig-nificant statistical evidence that rain has an powerful washing effect on particulate matter in the atmosphere.

3. For all stations, wind currents from the east result in cleaner PM10categories. Specially for stations

located at the east side of the city, wind variables are ranked very high. Namely, for stations E5, E10

and E18 rankMean.WndDirandMean.WndDirCatat the top of the table. Max.WndSporMean.WndSp

is also ranked at the top five for all three stations. This last result suggests that wind speed is not as important as wind direction.

4. An interesting result, not from the CART method but from table 4, is that stations located at

the west of the city (E7 and E2) have higher mean PM10 concentrations that the other stations

located closer to the west border of the city (E5, E10 and E18). This observation in conjunction with conclusion number 1 makes perfect sense, since wind currents from more polluted zones of the

city result in higher PM10 concentrations.

5. As it was expected, low values of the variable PWAT imply higher values of the output variable.

Although this result makes sense, it’s appearance is surprising since precipitable water only affects pollutants concentrations indirectly by affecting thermal inversions.

6. We emphasize the fact that the defined radiosonde variables were well ranked only on the station closest to the airport. This fact suggests that this variables are very important but distance affects significantly their predictive power.

7. Once Holidays and Sundays were removed from the data, the variable Day.Typenever appeared as

a split or amongst the top 6 of important variables. This suggests that Saturdays and weekdays

(35)

7

An application

The mayoralty of Bogot´a city leads an environmental education project in which the use of particular

automobiles is prohibited between 6:00 and 19:30 hours, called “d´ıa sin carro” (DSC) in Spanish, “day without car” in English. Given the impact of traffic emissions in air quality, one would expect a significant

reduction in PM10 concentrations, although this is not always the case. Never the less, the question of

how much this reduction in concentration is due just to the meteorological conditions of that given day remains. One could try to answer this question with the classification trees built here by following to simple steps:

1. Classify the mean PM10 of DSC for each station using table 4.

2. Run the classification tree for the input variables of DSC for each station and compare the results.

The DSC is always on the first Thursday of February. Consider for instance the 3rd of February of 2011

and station E2. That day’s mean PM10 concentration was 120.43µg/m3, therefore it is classified in class

A (see 4).

For the second step, look at the classification tree for station E2. The first split asks for the cumulative

precipitation before the time of maximum PM10, which is 0 mm and therefore we go to node number 2.

Since February the 2nd had a mean PM10 concentration of 101.00µg/m3 <108.00µg/m3 we go down to

the right to node number 5. Finally, for that day dH.Est.Inv = 596m > 168m, which means that the

tree misclassified that day in class B.

Initially, one could think that the DSC was a total failure and that particular traffic have no impact whatsoever over the mean concentration of PM10. Nevertheless, after running the tree for the particular input data values of that day, it is clear that decisive factors such as previous day concentrations and no rain contributed as well.

(36)

8

Future Work

There is a lot of things that can be done in order to complement the analysis. This method is only as good as the input variables that are used, so it is important to complement the ones used in this thesis with other ones that may bring additional information. For instance, it makes sense that PM10

concentrations on nearby stations have an effect on PM10 concentration for a given station, specially

when the wind direction comes from those stations. So it would be interesting to include this variable for the tree construction. Also, the same analysis can be repeated for other output variables. We started

to work on maximum daily values of PM10 but found that the format of our database could not be the

same as the one for mean values.

8.1 Maximum daily PM10

The same methodology con be applied to daily mean and maximum values for every variable in table

2. The next step in this work was to repeat the same analysis for the output variable Max.PM10. Recall

from figure 3 that PM10 daily maximum is always attained between 7:00 and 8:00 hours. Therefore, it

makes sense to only use input variables that happen before this time of the day. Therefore, we restrict

to the radiosonde variables and “one day before” variables in table3:

ID Variable Type Description

Y2 Max.PM10 Class Seven categories defined by1/7-th

percen-tiles: A,B,C,D,E,F,G where A is the top 85.71% and G is the bottom 14.28%.

X7 Accum.Prcip.Bef.Max Numeric Accumulated precipitation until the time of

maximum PM10. (mm)

X8 Day.Type Class Four types of days: weekday, Holiday,

Satur-day and SunSatur-day.

X9 Mean.PM10.1DB Numeric Mean PM10 one day before. (µg/m3)

X10 Max.PM10.1DB Numeric Max. PM10 one day before. (µg/m3)

X11 PWAT Numeric Precipitable Water (mm)

X12 dH.Inv.Sup Numeric Surface inversion thickness (m)

X13 dH.Est.Sup Numeric Stable surface layer thickness (m)

X14 PBLH Numeric Planetary boundary layer height (m)

X15 Accum.Prcip.1DB Numeric Previous day accumulated precipitation.

X16 Mean.WndSp.1DB Numeric Previous day mean wind speed (m/s).

X17 Max.WndSp.1DB Numeric Previous day max. wind speed (m/s).

X18 Mean.WndDirCat.1DB Numeric Previous day mean wind direction category.

Table 13: Studied variables in both BAQMN and radiosonde databases from 2009 to 2011, for daily maximum PM10.

The reduction in the number of input variables is significant, and it is very likely that previous day wind

variables have very little influence on maximum PM10 concentrations. Furthermore, we have seen that

radiosonde variables importance is strongly correlated to the distance from the station to the airport. Taking into account this remarks, we present the results for station E7 since this is one of the closest stations to the airport.

(37)

Station 14.28% 28.57% 42.85% 57.14% 71.43% 85.71%

E7 84.00 99.00 114.00 127.43 143.00 165.00

Table 14: 1/7th-percentiles used in the max. PM10 category definitions for E7- “Fontib´on” from 2009 to

2011.

Variable Importance %

dH.Est.Sup 21

Mean.PM10.1DB 17

PWAT 15

dH.Inv.Sup 14

Max.PM10.1DB 11

Day.Type 8

Table 15: Variable importance max. PM10 CART, station E7 - “Fontib´on”

Max. pm10 µg/m3

Frequency

0 100 200 300 400 500

0

50

100

150

Figure 26: Max PM10 histogram, station E7 - “Fontib´on”.

The first interesting observation about this classification tree is thatDay.Typeis once again an important

variable. This may be due to the fact that there are fewer variables now. Nevertheless, the split in node 4 is confirming the fact that week days are slightly cleaner than Saturdays.

As it was expected, radiosonde variables are very well ranked. As with Day.Type, this may be a

con-sequence of the reduction of input variables for this model. Nevertheless, the first split contradicts the

expected behaviour of the variable dH.Est.Sup. Previous day maximum and mean concentrations are

also very important and as it was expected, previous day wind variables don’t appear in the top 6 ranking.

We conclude that it is possible that the contradictions displayed by this tree are related to the little

number of input variables that may have a strong correlation with Max.PM10. In order to study this

(38)

way, all of the studied variables would occur before the time of maximum PM10 and therefore they may have predictive value.

dH.Est.Sup >= 16

PWAT < 22

Day.Type = Semana

Mean.PM10.1DB >= 50

PBLH >= 1169 D

.14 .14 .14 .15 .14 .14 .14

100%

A .19 .16 .16 .16

.14 .11 .07 62%

A .23 .16 .18 .17

.12 .09 .05 46%

A .26 .18 .19 .16

.09 .08 .04 39%

E .10 .07 .15 .20

.25 .16 .08 8%

E .07 .17 .10 .15

.20 .17 .14 16%

G .07 .11 .10 .12

.15 .20 .25 38%

F .10 .14 .12 .15

.14 .24 .11 22%

F .12 .15 .11 .07

.16 .27 .12 17%

D .02 .12 .16 .37

.09 .14 .09 5%

G .02 .07 .08 .09

.17 .14 .43 16%

yes no

A D E F G

(39)

8.2 Regression Trees

One could use classification trees to calculate a numerical prediction of the output variable simply by calculating a weighted average with the class proportions on each terminal node. Nevertheless, there is a more precise method for using binary trees as regression rules: regression trees.

A regression tree is very similar to a classification tree, except that it is used to predict a quantitative response rather than a qualitative one. Since this work had a qualitative goal, classification trees were more appropriate in this case. For future work in this area that may require prediction, regression trees

may be fitter than classification trees. For a detailed treatment of the subject see [1, Chapter 3], [7,

Chapter 14] or [6, Chapter 8].

As an example, we show here the regression tree constructed for mean PM10 with the data from station

E2 - “Carvajal” in the period 2009-2011.

Accum.Prcip.Bef.Max >= 0.05

Mean.PM10.1DB < 62

Mean.Prcip >= 0.16

Mean.PM10.1DB < 106

Mean.PM10.1DB < 108

dH.Est.Sup < 168

Mean.WndDir >= 306

Day.Type = Sabado

PWAT >= 20

PWAT >= 23

Mean.PM10.1DB < 129

dH.Inv.Sup < 88

96 100% 80 28% 70 6% 84 21% 77 9% 88 13% 85 10% 101 3% 102 72% 97 50% 92 31% 83 5% 94 26% 84 5% 96 21% 104 19% 95 7% 109 12% 113 23% 92 2% 115 21% 112 16% 107 10% 120 6% 126 5% yes no

(40)

References

[1] Leo Breiman et al.Classification and regression trees. CRC press, 1984.

[2] Bert Brunekreef and Stephen T. Holgate. “Air pollution and health”. In:The Lancet360.9341 (2002),

pp. 1233–1242.

[3] F Bruno, D Cocchi, and C Trivisano. “Forecasting daily high ozone concentrations by classification

trees”. In: Environmetrics 15.2 (2004), pp. 141–153.

[4] Liliana Andrea Giraldo Amaya. “Estimaci´on del inventario de emisiones de fuentes m´oviles para la

ciudad de Bogot´a e identificaci´on de variables pertinentes”. In: (2005).

[5] Davor Z Antanasijevi´c et al. “PM 10 emission forecasting using artificial neural networks and genetic

algorithm input variable optimization”. In: Science of the Total Environment 443 (2013), pp. 511–

519.

[6] Gareth James et al. An introduction to statistical learning. Vol. 6. Springer, 2013.

[7] Max Kuhn and Kjell Johnson. Applied predictive modeling. Springer, 2013.

[8] Artemio Sotomayor-Olmedo et al. “Forecast urban air pollution in Mexico City by using support

Referencias

Documento similar