Nuevos escenarios universitarios virtuales

EL NUEVO PARADIGMA DOCENTE

3.5. Nuevos escenarios universitarios virtuales

This chapter includes a comprehensive analysis of the spatial distribution of data points with the aim of helping with the optimization of interpolation parameters. The first spatial characterization of the set3 case study was the average distance between sampling points. Although it is a very basic parameter, it proved to be important for the QMI profiles, MW averaging and variography calculations.

The high level of clustering of set3 was quantified by different methods like Voronoi polygon statistics, sandbox counting, box-counting and Morisita index. With the use of va- lidity domains including a build-up constrained domain, the relative clustering was mea- sured. On the national scale, the sandbox method has indicated that the actual national indoor radon sampling is more clustered (with a Df = 1.3) in comparison to the build-up area (Df = 1.76). Meanwhile, the build-up area has a clustering coefficient similar to a ran- dom distribution of points. The high degree of spatial clustering for indoor radon samples

is mainly due to preferential sampling of certain areas and partially due to the sampling domain, which is the build-up zone in Switzerland.

The idea of using quantiles was introduced in order to make comparative clustering calculations for the functional box-counting method and the Morisita index. The so-called Quantile Morisita Index (QMI) analyzes the spatial distribution of equal-sized subsets defined according to quantile thresholds for a certain scale. The QMI diagram showed, more clearly, that the results obtained with sandbox and box-counting methods for set3 present higher spatial clustering both for low and high values at the average distance.

The cell and the polygonal declustering methods were applied to approximate the global unknown statistics of indoor radon. It was observed that data transformation is more coherent when using weights with lower variations. This was obtained by limiting the cells and polygons covering a more constrained sampling space. It was reaffirmed that the build-up area is not only the optimal but also the natural sampling domain to be used. A very constrained Voronoi polygons coverage or a cell declustering, using windows of around 600 m, can fit within the sampling domain. In particular, the cell declustering method is the one that can better adapt to a manifold domain such as the build-up area.

Methods used for interpolation, like KNNR or variography, can be also used in his modeling phase as exploratory tools for spatial properties. The convexity of the KNNR cross-validation (CV) optimization curve gave a first measurement of spatial continuity. Var- iography over set3 showed high local variability, which prevents data from being fit into a structured model.

Moving Windows (MW) averaging was proposed as a method to reduce local variability. Based on the coherent results of clustering characterization using different methods, the windows size used for the national set was 1700 m. With MW, a local mean was calculated and adjusted to a regular configuration. This procedure helped reduce the high local variability and find variography structures on a national scale.

The mean and the skewness parameters after an MW analysis of set3, have indicated the presence of a transition zone in the middle of the area. The presence of this transition zone was also cleared with a lognormality skewness test. The recurrent indication of a transition zone for set3 lead to the idea of doing a spatial partition of data based on local statistics.

An important output from the spatial data analysis is that data partition into homoge- neous distribution areas (in particular for high values) can provide a solution to improve modeling. As this is a matter of using different scales and domains, a multiscale and partition analysis was proposed. The scales of interest defined for the study of indoor radon in Switzerland were the national domain, the natural regions domain, the Moving windows scales, and the administrative units domains.

During the multivariate analysis, it was concluded, that geotechnical units and eleva- tion play an important role in explaining indoor radon spatial distribution. Natural regions in Switzerland have distinctive characteristics based on these two variables. Therefore, they constitute an important domain for a regional analysis of indoor radon. Indoor radon samples for the Jura region presented a structured variogram, while spatial continuity for the

Alps region was not clear. The Plateau region had clustering for low and high values. This clustering feature, reflecting high local variability, was also seen for set3.

As for the administrative domain, the cantons of Ticino, Bern and Neuchatel have a clustering of either low or high values. The resulting variograms for these cantons present some structure, with Bern being the least structured. The historical analysis of samples for the canton of Neuchatel show an evolution from an even distribution before 2005 to clustering of high values during the 2006-2008 campaign. This information can be used to approach the global statistics on a cantonal level.

Spatial interpolation with regression

methods

Tuning model parameters and hyper-parameters properly, is almost entirely the task of interpolation. The optimization of parameters becomes more difficult when also considering hyper-parameters, as for example, scale or neighborhood. In the previous chapter, the issue of scaling was considered and some solutions were proposed, in particular for variography modeling. In addition, neighborhoods were characterized with the use of the KNNR method. Other methods are more robust for handling data without stationarity assumptions.

This chapter will be dedicated to optimizing modeling for different interpolation methods. It is an attempt to compare and integrate methods to achieve a common goal, which is the proper parameter definition. The basic assumption is that neighborhood and continuity are common properties of interpolation methods, but they are expressed in different ways.

It can be said that methods can be differentiated based on their orientation to express a certain spatial property. Neighborhood or the level of similarity of values in space can be de- picted with a simple method such as K Nearest Neighbors Regression (KNNR). Variography aims to describe the continuity of values in space. Density functions, describing the distribution of points around a central point, are used in the General Regression Neural Networks (GRNN).

In the present chapter, linear and deterministic models, like KNNR and IDW, are presented. Then, the linear geostatistical models (the kriging family) are applied, followed by the non-linear GRNN regression method. These methods are referred to as regression interpolation, since in one way they aim to reduce an error measure in order to obtain a general model: the optimum k for KNNR, unbiased estimators for kriging or a generalized regression model for GRNN. The multiscale analysis of chapter 3 is complemented here with the MW analysis for GRNN.

On the one hand, the goal is to find the method that adapts best to the data characteristics and on the other hand one must examine how suitable data are to make estimations. These are the method robustness and the data consistency analysis addressed at the end of the chapter.

4.1 Estimations and Predictions

4.1.1 Notes on terminology

The final objective of data sampling is to calculate new values for a dependent variable. In space, the calculation of a new value is made for an unvisited location with a defined position. When this is done within the limits of the sampling spatial domain, it is referred as interpolation, and when it refers to the external case it is known as extrapolation. In statistics, the calculation of a new value based on samples is referred to as a prediction, while when speaking of parameters the term estimation is used. Spatial interpolation is often referred to as an estimation to differentiate it from temporal predictions. This is particularly the case with the geostatistical jargon (31). As will be seen, the estimation of parameters is the essential part of geostatistics, and hence, the results of interpolation using kriging family interpolators imply estimating values and the corresponding uncertainty. For the case of simulations, the estimation concept is even clearer since the parameters of a local distribution are estimated for each unsampled location. Within the following text the term estimation will be used to refer to interpolations of unsampled points when involving statistical modeling; as it is done in geostatistical literature.

In contraposition to methods that made use of statistical modeling there are deterministic methods, which are simpler. Even when it is understood that most earth science processes are uncertain and cannot follow a well-defined physical model, the term ’deterministic’ is used for methods where this uncertainty is simply not taken into account. In literature, the term spatial prediction is preferred for the results of deterministic methods. As will be seen, the results of both types of methods can have both indistinctively deterministic and prob- abilistic interpretations. Regardless of this interpretation, the interpolation methods can be also distinguished by their use of either linear or non-linear models.

4.1.2 Error measures

A first error measure to be considered is the training error that is obtained with cross- validation as explained in chapter 3. It is an error measure for the model itself and not necessarily valid for predictions. The training error for subsets can be an effective measure of the statistical consistency between samples and global distribution.

Additionally, there are measures of model uncertainties particular to a certain method. For instance, kriging provides a local measure of training error called the kriging variance. GRNN provides a measure of data density. Both measures are indicators of areas prone to generating high estimation uncertainty due to the lack of data.

Errors can be represented in different ways; the basic measure is the difference between the predicted value ˆzand the real value z. These real values constitute the validation set and are independent samples that were not used for modeling or prediction. A good measure of

error is the mean of the squared errors (MSE) and is represented as: M SE = 1 n n X i=1 (ˆzi− zi)2 (4.1)

It is a convenient measure because it relates to statistics of error distribution: MSE = variance + bias2. The bias is the mean of the error distribution and it indicates a tendency of the prediction model to produce either underestimates or overestimates. The prediction error variance can be used as an indicator of model adjustment to data. Prediction model parameters are tuned based on the error reduction concept. For the sake of precision, it is always good to include other error measures, such as the correlation between real and predicted values.

In document Modelos constructivistas de aprendizaje en programas de formación (página 61-82)