independientemente de que el jubilado sea

There are two separate approaches to the development of a sampling plan, the design-based approach and the model-based approach. In the design-based approach, also referred to as the probability sampling approach (Valliant et al., 2000), the population in the study region is viewed as having a fixed set of values (Haining, 2003, p. 96). Sampling locations are

selected according to some randomization scheme. This scheme is designed to ensure that it yields a parameter estimate with the desired statistical properties (e.g., unbiasedness, minimum variance). This reliance on a proper design is the source of the term “design- based” (Webster and Oliver, 1990, p. 28). The random and stratified sampling plans discussed in Section 5.3 are examples of the design-based approach. There has been an increase in interest in using the model-based approach in spatial sampling (Haining, 2003; Griffith, 2005), and for this reason, we will give a simple example of how it might be applied to the artificial data set used in this chapter.

In the model-based approach, the population itself is considered to be one realization of a stochastic process. Other realizations are possible, and the population properties of interest, such as the mean and variance, are functions of the values of a random process. These properties are therefore themselves random variables and technically are predicted rather than estimated (only fixed parameters of a population are estimated). For this rea- son, model-based sampling is also called prediction-based sampling (Valliant et al., 2000). The process of developing a sampling plan involves developing a model for the random process generating the data and then estimating the parameters of this model. Since the population in the study region is viewed as a random variable, there is no need to introduce randomness in the sampling pattern. Therefore, systematic sampling plans such as the grid-based plans described in Section 5.3 can be studied statistically using a model- based formalism.

To make this distinction a bit clearer, consider the simple example in which a finite population of size N is being sampled, and the objective is to choose a sample of size n to estimate the population mean:

μ = =

∑

YN i i N 1 (5.8) by computing the sample mean

Y Y n i i n = =

∑

₁ . (5.9)

In the design-based approach, the population {Yi, i = 1, …, N} is considered as fixed, and the

quantity _{μ is a fixed parameter. The sample {Y}i, i = 1, …, n} is a random quantity dependent

on the random selection of the n values to sample. The objective is to select a randomization

process that generates a value Y– that, according to some measure, optimally estimates _μ.

In the model-based approach, the population {Yi, i = 1, …, N} is viewed as a realization of

a random process, and therefore _μ(Y1, Y2, …, YN) defined by Equation 5.8 is a random vari-

able. The objective of sampling is to develop a sampling plan that optimally predicts _μ.

Suppose, for example, that the sampling plan was to sample every 10th value of the population. Under the assumptions of the design-based approach, since the population is fixed, this sampling pattern, if applied repeatedly, would yield the same sample values and the same estimate Y– each time it was applied. Under the assumptions of the model-based approach each time the sampling pattern was applied it would sample a different realiza- tion of a random process, so the sample values and the value of Y– would be random vari- ables. There is, of course, nothing special about the mean, and the same idea can be applied to the prediction of any other function of the random variable Y.

If one has a model for the relationship between Y and some explanatory variable X, then one can use this model to improve the accuracy of prediction. Valliant et al. (2000, p. 2) introduce the concept of model-based sampling with a simple example based on the number of patients discharged per day from a hospital versus the number of beds. We will begin our discussion with an analogous presentation using the relationship between wheat yield of the artificial data set and observed weed level at the nearest sample point. In their example, Valliant et al. (2000) estimate the total number of patients discharged in 33 hospitals given a sample consisting of the total number of patients discharged in 32 of them (equivalently, they estimate the number of patients discharged in the 1 nonsampled hospital). We will generate a similar example using the artificial yield population from Field 4.2.

To some extent, comparing design-based and model-based plans is a matter of comparing apples and oranges. Nevertheless, we will carry out an informal comparison using 32 sample points, the same value as the minimum size of the random and grid sample methods. This application also provides the opportunity to introduce the raster package (Hijmans and van Etten, 2011). The functions in this package can be used to manipulate raster objects in the ways described by Lo and Yeung (2007, p. 183). Here, we make only the simplest use of the package’s capabilities to compute a simple linear regression between weed level and May IR image digital number. We will use the fact (Kutner et al., 2005, p. 24)

that the regression line between Y and X passes through (X–, Y–), to estimate _{μ by computing}

the value of the regression line at X–. We emphasize that this is not necessarily the best way to carry out a model-based sampling plan, but it does illustrate the idea, and it will enable us to demonstrate some issues associated with model-based sampling.

Figure 1.2 indicates that there is an apparently close relationship between the IR band digital number of the May aerial image of Field 4.2 and the yield. We will base our prediction on this relationship. We start by loading the object data.May.ras, which contains the image band information.

> library(raster) > data.May.ras <- raster(“set4\\set4.20596.tif”) > class(data.May.ras) [1] “RasterLayer” attr(,“package”) [1] “raster”

By default, the function raster() loads band 1 of the TIFF file, which is the band that we want. We will use the spatial locations of the 32 grid sample points in Figure 5.6. We can apply the function extract() to the raster object to place the IR values of the cell contain- ing each of the 32 sample points into the data frame data.samp created in Section 5.3.

> data.samp$IRvalue <- extract(data.May.ras, spsamp.pts)

Figure 5.12 shows a plot of yield versus IR band digital number together with the least squares regression fit. The code to generate the fit and compute the value of Y– is as follows.

> Yield.band1 <- lm(Yield ~ IRvalue, data = data.samp) > summary(Yield.band1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 14197.131 1207.300 11.76 9.24e-13 *** IRvalue -69.511 8.581 -8.10 4.84e-09 ***

A discussion of this code, as well as the code to produce the plot, is given in Section 9.3.

The fit is reasonably good (r2_{= 0.69), which provides justification for the use of this model}

in our model-based sampling plan. The estimate of the population mean is obtained as the mean of the predicted values based on the linear regression model.

> print(Y.bar <- mean(predict(Yield.band1) ), digits = 5)

[1] 4456.9

> print(abs(Y.bar - pop.mean)/pop.mean, digits = 3)

[1] 0.0161

The error of 1.6% is about the same as that of grid sampling.

It is evident that the estimate of μ is highly dependent on the accuracy of the model.

This in turn depends in part on the choice of sample locations. This choice is important both for design-based sampling plans and for model-based sampling plans, but the use of a model-based plan provides an opportunity to illustrate this effect very graphically. We will consider three cases (Figure 5.13). The first is to sample seven points in a north to south transect in a high-weed area, the second is to sample seven points in a north to south transect in a low-weed area, and the third is to sample along an east to west transect

spanning the field. The predicted yield means of the three samples are Y–1 = 3208 kg ha−1

(29% error) for transect 1, Y–2 = 5424 kg ha−1 (20% error) for transect 2, and Y–3 = 5000 kg ha−1

(1.5% error) for transect 3. These results are summarized in Figure 5.14. The figure shows the data, the regression line, and the estimated mean of each of the transects. The first two transects (Figure 5.14a and b) give an incorrect representation of the yield–IR relationship because they are from regions of low and high IR values, respectively, whereas the east–west transect covers the range of IR values and provides a fairly accurate representation. The estimate from the first transect (Figure 5.14a) is biased downward, that from

110 120 130 140 150 160 2000 3000 4000 5000 6000 IR digital number Yie ld (k g ha –1) FIGURE 5.12

the second transect is biased upward (Figure 5.14b), and that from the center transect is relatively accurate (Figure 5.14c). It is evident that care needs to be taken in selecting the sample locations. This applies equally to the results of a sampling plan using the design- based model.

The problem with both Y–1 and Y–2 is that they are based on samples from only a small

geographic part of the field. Since the field has a strong geographic trend, this is equiva- lent to saying that they are based on only a small subset of the possible values of IR value and yield. Partly as a result, the regression models based on first two sets of samples are incorrect, and since the samples are from either a higher than average range of IR values (in transect 1) or a lower than average range of IR values (in transect 2), the estimates are not robust to these incorrect models. Since the model based on the east–west transect is accurate, we cannot say anything about whether this sample is correspondingly robust to an incorrect model. We can say, however, that the most accurate estimate comes from a sample that is “representative” of the range of values of IR and yield. This concept of “representativeness” can be formalized through the property that a sample must be bal- anced (Valliant et al., 2000, p. 53), which means, roughly speaking, that each sample value represents, or is close to, about an equal fraction of the totality of values in the population.

5.8 Further Reading

Cochran (1977) is the classical reference for sampling, although it contains little in a spatial context. The two books by Webster and Oliver (1990, 2001) contain a wealth of valuable material on sampling spatial data, as do Ripley (1981) and Haining (1990). Odeh et al. (1998) provide an excellent example of the use of information at multiple scales (see Chapter 6) to

592200 592400 592600 592800 Easting 4267400 4267600 4267800 4268000 Nor th in g Transect Sample 1 Sample 2 Sample 3 FIGURE 5.13

direct sampling. Brus (1994) describes a design-based stratified soil sampling plan. Valliant et al. (2000) provide a good introduction to model-based sampling. Olea (1984) and Lesch et al. (1995) discuss sampling plans that have model-based aspects. Edwards (2000) and Brus and DeGruijter (1997) provide a good overview of sampling concepts for ecological data. The notion of distinguishing sampling error from nonsampling error is discussed by Biemer and Lyberg (2003). This concept goes back at least to Fisher (1935), who provides an excellent discussion of this issue.

The comparison of sampling plans in this chapter applies only to sampling a rectangular region. van Groenigen and Stein (1998) and van Groenigen et al. (1999) provide mathemati- cal methods for generating sampling plans on irregularly shaped regions that minimize the kriging variance. The mathematics of these plans is quite intricate, but simply by look- ing at the figures in the papers one can gain a good idea of how these sampling schemes relate to one that would be generated on a rectangular region.

110 (a) 120 130 140 150 160 170 2000 3000 4000 5000 6000 7000 IR digital number Yield (kg ha –1) Y Sample All data Transect 1 Sample All data Transect 2 Y1 110 (b) 120 130 140 150 160 170 2000 3000 4000 5000 6000 7000 IR digital number Yie ld (k g ha –1) Y Sample All data Transect 3 Linear fit All data Transect 2 Linear fit All data Transect 3 Linear fit All data Transect 1 Y2 110 (c) 120 130 140 150 160 170 2000 3000 4000 5000 6000 7000 IR digital number Yie ld (k g ha –1) Y3 FIGURE 5.14

Regression relationships of the grid-based sample together with the models and the estimates based on the three sample transects in Figure 5.13: (a) transect 1, (b) transect 2, and (c) transect 3.

Exercises

5.1 Read about the sp function spDistsN1(). Create a new function closest.point()

that uses this function.

5.2 (a) Use the boundary file of Field 1 of Data Set 4 created in Exercise 2.11 and the func-

tion spsample() to create a regular grid sample plan with 100 sampling sites for the field. Use the function points() to add a plot of the sample point locations to the map.

(b) Use the function class() to check the object class of the sampling plan created in part (a). Use the function str() to display the structure of the object. Use the function coordinates() to display the coordinates of the first 10 sample locations in the object.

(c) Does the number of points in the sample plan equal the number you specified? Answer the question without counting the points (use the information provided by str() ).

5.3 Use the function expand.grid() to create a grid of sample points in Field 1 of Data

Set 4 with the same spacing as that in Exercise 5.2. Use the function coordinates() to convert the object created by expand.grid() into a SpatialPoints object. Create a map showing the field boundary and the two sets of data locations, each with a different symbol.

5.4 It sometimes happens with an irregularly shaped boundary that the function

expand.grid() creates sample locations outside the sample area boundary. Use the function overlay() to create a SpatialPoints object that does not include locations in the set created in Exercise 5.2 lying outside the field boundary. Create a map that shows this sample plan together with the field boundary.

5.5 Assume the 86 sample values are the entire population (i.e., N = 86) of clay content

values in Field 1 of Data Set 4. Suppose you want to estimate the mean.

(a) Compute the error in estimating the mean based on a random sample of six clay values.

(b) Suppose you have EM38 values (which are much easier to measure) at all 86 locations, taken on April 25 from the beds (see Appendix B.4). You can collect six soil cores. Use the EM38 data to stratify the sample, creating two zones, one of high clay and one of low clay (make them the same size). Collect a random sample totaling seven samples within each zone and compute the error in estimating the mean. Remember that the strata cannot be of equal size.

(c) Compare the result of parts (a) and (b) with estimate obtained by taking a north– south transect of seven soil cores consisting of every other data location in the middle column of the data starting from sample point 4.

5.6 Suppose in the problem of Exercise 5.4 you have taken a north–south transect of seven

soil cores consisting of every other data location in the middle column of the data starting from sample point 4. Use this together with the EM 38 data to construct a model-based estimate of the mean of the 86 values of clay content.

155

6 Preparing Spatial Data for Analysis

6.1 Introduction

Georeferenced data usually require considerable manipulation and checking before they can be subjected to a statistical analysis. Point sample data, whether they are manually sampled (e.g., soil core data) or automatically sampled (e.g., LiDAR data), must often be converted from a spreadsheet or database format into a geographic information system (GIS) data file format such as the ESRI shapefile. Automatically sampled data often contain numerous outliers that must be detected and dealt with. Image data must often be geo- registered to the earth’s surface. Moreover, spatial data are often misaligned, that is, they are not recorded at the same location and, in a sense that will be made clearer later in this chapter, they are often measured at different scales.

One of the issues that must be decided in the initial data processing phase is the projection in which to represent the data. In the Northern Hemisphere, the two most common choices are longitude–latitude and universal transverse mercator (UTM) coordinates. This book does not provide any discussion of geographic coordinate systems. A complete discussion of this topic is given by Lo and Yeung (2007, chapter 2). Most global positioning systems (GPSs) permit the user to select either longitude–latitude in the WGS84 datum (Lo and Yeung, 2007, p. 40) or UTM. The advantage of working in UTM is that it is a con- formal projection (Lo and Yeung, 2007, p. 43), that is, shapes of reasonably small areas on the ground are preserved on the map. A major disadvantage of UTM coordinates is that the UTM zones are only 6° of longitude in width, and as a result, many geographic features span more than one zone. For example, Data Set 2 is located in UTM Zones 10 and 11. As a general rule, it is a good practice when recording data in the field to record in the same projection as that of other layers in the data set, or, if these are in different native projections, to record in the same projection as that of the data set considered the most geographically accurate. R provides the capability to transform the projection of a data set (Section 2.4.3). Also, the functions in the sp package (Pebesma and Bivand, 2005; Bivand et al., 2008) can correct for the spherical shape of the earth when performing distance calculations using longitude and latitude.

In this chapter, we discuss data quality evaluation and refinement. Attribute data are often represented as a table in which the columns are the data fields and the rows are the data records (Section 2.4.1; Lo and Yeung, 2007, p. 76). In this representation, a data field is a specific attribute item such as mean annual precipitation, and a data record is the collection of the values of all the data fields for one occurrence of the data. We begin in Section 6.2 with a discussion of quality control of attribute data. Sections 6.3 and 6.4 deal with the spatial component. Section 6.3 presents a brief discussion of geostatistical interpolation

procedures to estimate its attribute values at different locations. Section 6.4 deals with the application of these procedures to the resolution of misalignment problems.

In document LOS JUBILADOS DEL SISTEMA DE REPARTO: (página 25-31)