Selección de metadatos - Implementación de las colecciones

Capitulo 3: Arquitectura de la solución

3.2 Implementación de las colecciones

3.2.1 Selección de metadatos

My predictor considers the total average daily traffic flow between 01−01−2016 and 01−06−2017’ for Birmingham, UK accounting for 711 sensors as described in Chapter 3.

Setting A - Extrapolation: I train all data that sits within Birmingham’s BUA, accounting for 711 sensors. The remainder are removed for ground truth testing. To fully simulate extrapolation, I confirm that my training and hold out sets are not correlated (i.e. SACtraintest∼0). A standard Moran’sI test is

Distance and T ra v el Time Cross-V alidation for Urban Mo dels

(a) A holdout method to simulate extrapolation.

(b) A holdout method to simulate a mixture of extrapolation.

(d) A holdout method to simulate extrapolation.

(e) A holdout method to simulate a mixture of extrapolation and interpolation.

(f) A holdout method to simulate interpolation.

Figure 6.4: Producing a ground truth train and test set. The orange space represents the training area, the yellow space represents the ground truth test area, the blue points are ground truth testing locations and the white to red points represent the training set where the

Distance and T ra v el Time Cross-V alidation for Urban Mo dels

(a) NRMSE for Ordinary Kriging on Coventry House Price with all KCV Methods Compared with Extrapolation.

(b) NRMSE for Ordinary Kriging on Coventry House Price with all KCV Methods Compared with a Mix of Interpolation and Extrapolation.

(d) NRMSE for Ordinary Kriging on Coven- try House Price with the Winning KCV versus Random KCV with Standard Error Bars.

(e) NRMSE for Ordinary Kriging on Coven- try House Price with the Winning KCV versus Random KCV with Standard Error Bars.

(f) NRMSE for Ordinary Kriging on Coven- try House Price with the Winning KCV versus Random KCV with Standard Error Bars.

Figure 6.5: Results graphs for both case studies: dead-zone size versus NRMSE for all KCV methods and the ground truth.

Road Distance and T ra v el Time Cross-V alidation for Urban Mo dels Results Table

Real Estate Case Study Traffic Flow Case Study

Random Previous My Work Random Previous My Work

Work [106] Work [106]

KCV S-KCV R-KCV T-KCV RT-KCV KCV S-KCV R-KCV T-KCV RT-KCV

Case A : Extrapolation (Train: 3412 - Test: 256) Case A : Extrapolation (Train: 711 - Test: 72)

100% 3298+ 3298+ 3254 3298+ 3274 579+ 579+ 579+ 579+ 578

80% 3298+ 3298+ 3105 3156 3112 579+ 578 578 578 577

50% 2850 2628 2489 2201 2112 576 573 573 573 566

Case B : (Train: 3163 - Test: 256) Case B : Mixed (Train: 675 - Test: 72)

100% 2183 1931 1420 1391 1401 498 487 487 478 442

80% 2108 2006 1270 1308 1295 458 309 298 276 276

50% 1940 178 9 8 8 62 57 68 71 73

Case C : Interpolation (Train: 3290 - Test: 256) Case C : Interpolation (Train: 639 - Test: 72)

100% 1489 201 10 8 8 84 72 52 57 55

80% 1417 199 8 6 6 67 60 42 52 46

50% 201 164 4 4 4 42 31 30 29 30

conducted between both datasets showing a weak spatial relation such that Iobserved=−0.008960041 and Iexpected= 0.000201. As such, I confirm that my method can be tested against the split data for extrapolation generalisability, see Figure 6.4(d) for a visual representation.

Setting B - Interpolation: I train on some of the data that sits within Birming- ham’s BUA, accounting for 675 sensors.

Setting C - A Mixture of Interpolation and Extrapolation: I train on some of the data that sit within Birmingham’s BUA, accounting for 639 sensors.

A Competitor Case for Comparison - Blocking: My blocking approach uses 10 folds and is set up such that, for each fold, 10 random points within the training area are selected and a square block grows out so that all blocks have equal frequency (± 1) and also sums to the same sized test set as all other experiments (72 points). I only apply this in settings B and C because setting A contains no test points within the training set.

Results

All methods in all settings have a test set of 72 points for comparison. In addition, each KCV method contains the same test points for each setting. Fig- ures 6.5(d)-(f) show the NRMSE value for each cross validation method (KCV, S-KCV, R-KCV, T-KCV and RT-KCV). Additionally, the graphs show equal training set random removal, blocking and each settings ground truth NRMSE. Each KCV method is run 10 times and over 10 folds, showing that RT-KCV consistently outperforms all other approaches in all settings. Notably, the ben- efits of RT-KCV to my case study, although strong, is less significant for this case study compared with my house price case study, this can be explained by the weaker spatial correlation as seen by my Moran’sIvalue in Section 3.3. Fi- nally, my dead-zone radius heuristic estimates that 577, 458 and 87 points need

to be removed for extrapolation, mixed and interpolation respectively. Once implemented, I determine that the difference in the estimated NRMSE values (0.184, 0.172, and 0.1635) compared with the ground truth values (0.193265, 0.170, 0.158) are relatively small compared to S-KCV and blocking (with the exception of interpolation which is negligible). A t-test shows that for two out of three experiments (extrapolation and mixed), the number of points that are removed from the training set are significantly less with my new RT-KCV approach compared with the previous state-of-the-art, with a t-value of 0.01. In addition, Figure 6.5 empirically demonstrates a significant estimation of generalisation improvement, because one can see that the ‘mean operating point’ (my newly defined measure of generalisation performance) is significantly closer to the ground truth in all scenarios of extrapolation to interpolation, compared with S-KCV and blocking (the current state-of-the-art) for both case studies.

6.6 Final Remarks

The purpose of cross validation is to estimate how well a model will generalise to unseen data and unlabelled locations in spatial settings. However, standard KCV assumes all data to be i.i.d random variables and hence does not take into account the dependencies between the training and test set, which causes bias and optimistic estimates of generalisation. SAC is always present with spatial data and as such needs to be accounted for. Traditional validation approaches such as KCV omit the effect of SAC in performance estimations to unseen locations with urban datasets. To account for SAC in urban data I demonstrate that my new approach, termed RT-KCV, can be used to better estimate the generalisation ability and predictive performance of spatial models than existing state-of-the-art approaches (S-KCV). I also show that road distance and travel time can decrease the required ‘dead-zone’ data removal for capturing SAC in urban spaces, leading to a more efficient use of labelled datasets. Finally, I confirm that RT-KCV is a superior approach for estimating model generalisation

compared with all other CV methods.

I recommend that RT-KCV be used wherever dependence structures exist in a dataset with restricted space (such as cities), even if no structure is visible in the fitted model residuals, or if the fitted models account for such correlations (for example in Kriging). I note that standard KCV is only appropriate for pure interpolation where the internal dependence structure is present in the unknown values. Notably, I show that, for urban data, a combination of road distance and travel time capture SAC better than Euclidean distances.

Further avenues for research include: (1) developing techniques to better map SAC in other dependent datasets, such as ‘stream’ distances (along a river or canal); (2) optimising the operating point on the RT-KCV curve to better match the ground truth performance and (3) learning the convex combination parameters for the combined RDTT distance i.e., remove the requirement to manually select some weighting of road distance and travel time.

In the remaining two chapters, I will (1) put forth a set of answers to the research questions (RQ) posed in Section 1.1, match those answers with my results, and discusses the implications of my work to urban science, geostatistics and real estate and (2) conclude all of my findings and put forth a set of research avenues that are opened up by this thesis.

Discussion and Applications

“The only true voyage. . . would be not to visit strange lands but to possess other eyes, to see the universe through the eyes of another, of a hundred

others, to see the hundred universes that each of them sees”

Marcel Proust (1923),La Prisonni`ere from theRemembrance of Things Past.

Cities are inherently spatial;urban proximity is related to mobilityandrestricted road networks can measure urban space: three statements which the findings in Chapters 4-6 confirm. Additionally, non-Euclidean distances can improve (1) geostatistical urban models and (2) the estimation of the generalisation performance of a (spatial or otherwise) model for all interpolation-extrapolation scenarios.

The above summary of findings is examined in detail throughout this Chap- ter. Section 7.1 outlines the thesis contributions in response to the research questions put forward in Chapter 1. Thereafter, the implications of this thesis research on urban science, geostatistics and the real estate industry are consid- ered in Sections 7.2-7.4. Finally, the potential limitations to the generalisation of this research are introduced in Section 7.5.

7.1 Answers to Research Questions (RQ)

At the start of this thesis three research questions were put forth:

1. RQ1: Which distance function best models spatial interactions in an urban setting?

2. RQ2: When, if ever, are non-Euclidean distance functions valid for urban spatial models?

3. RQ3: How should one estimate the generalisation performance of urban spatial models?

Each research questions (RQ) motivates the contributions throughout this thesis and we explore these contributions below.

In document Biblioteca Digital para el Centro de Estudios Comunitarios (página 67-79)