Infraestructura Básica: - DOCUMENTOS DE LICITACIÓN DSC-L-08/2014

The typical Bayesian method of model comparison is the posterior odds ratio, which is the relative probability of two completely specified models. However, there are some cases where the researcher is interested in investigating the per- formance of a model in some absolute sense, not relative to a specific alternative model. Also, there are many cases in which the researcher might want to use an improper, noninformative, prior and, as discussed in Chapter 3, posterior odds can be meaningless if such a prior is used on parameters which are not common to all models. In both such situations, the posterior predictive p-value approach is a sensible alternative to the posterior odds ratio. The reader interested in more details about this approach is referred to Gelman and Meng (1996). For some innovative extensions on the basic approach outlined below, the reader is referred to Bayarri and Berger (2000).

To motivate the posterior predictive p-value approach, it is important to dis- tinguish between y, the data actually observed, and y†, observable data which could be generated from the model under study (i.e. y†is an N ð1 random vector with p.d.f. p.y†j/ where the latter is the likelihood function without y plugged in). Let g.Ð/ be some function of interest. p.g.y†/jy/ summarizes everything our model says about g.y†/ after seeing the data. In other words, it tells us the types of data sets that our model can generate. For the observed data we can directly calculate g.y/. If g.y/ is in the extreme tails of p.g.y†/jy/, then the model cannot do a good job of explaining g.y/ (i.e. g.y/ is not the sort of data characteristic that can plausibly be generated by the model). Formally, we can obtain tail area probabilities in a manner similar to frequentist p-value calcula- tions. In particular, the posterior predictive p-value is the probability of a model yielding a data set with more extreme properties than that actually observed (i.e. analogous to a frequentist p-value. You may wish to present either a one-tailed or two-tailed p-value).

p.g.y†/jy/ can be calculated using simulation methods in a manner which is very similar to the one we used for predictive inference. That is, analogous to (4.28) and the discussion which follows it, we can write

p.g.y†/jy/ D Z

p.g.y†/j; y/p.jy/ d D Z

p.g.y†/j/p.jy/d (5.13) where the last equality follows from the fact that, conditional on, the actual data provides no additional information about y†: The posterior simulator provides draws from p.jy/ and we can simulate from p.g.y†/j/ by merely simulating artificial data from the model for a given parameter value in a manner identical to that used for prediction (see (4.30)–(4.32) of Chapter 4).

Posterior predictive p-values can be used in two different ways. First, they can be used as a measure of fit, of how likely the model was to have generated the data in an absolute sense. Secondly, they can be used to compare different models.

That is, if one model yields posterior predictive p-values which are much lower than another, this is evidence against the former model. However, most Bayesians prefer posterior odds ratios for the latter unless the use of noninformative priors makes the posterior odds ratios meaningless or difficult to interpret.

The posterior predictive p-value approach requires the selection of a function of interest, g.Ð/. The exact choice of g.Ð/ will vary depending upon the empirical application. To take a practical example, let us return to the nonlinear regression model. For this model, we have

y_i†D f.Xi; / C "i

for i D 1; : : : ; N. Alternatively, given the assumptions we have made about the errors,

p.y†j; h/ D fN.y†jf.X; /; h1IN/ (5.14) where f.X; / is the N-vector defined in (5.2). Note that, for given values of the parameters of the model, simulating values for y† is quite simple, involving only taking draws from the multivariate Normal. This simplicity is common to many models, making the posterior predictive p-value easy to calculate in a wide variety of cases.

For the nonlinear regression model with noninformative prior given in (5.4), (5.14) can be simplified even further, since h can be integrated out. In particular, using a derivation virtually identical to that required to go from (5.5) to (5.6), it can be shown that

p.y†j / D ft.y†jf.X; /; s2IN; N/ (5.15) where

s2D [y f.X; /]

0_{[y f}_{.X; /]}

N (5.16)

Hence, conditional on , draws of y† can be taken using the multivariate t distribution. These draws can be interpreted as reflecting the sorts of data sets that this model can generate. The posterior predictive p-value approach uses the idea that, if the model is a reasonable one, the actual observed data set should be of the type which is commonly generated by the model. Finding out at what percentile the point g.y/ lies in the density p.g.y†/jy/ is the formal metric used. To make things more concrete, let us digress briefly to motivate a few choices for g./. It is common to evaluate the fit of a model through residual analysis. The frequentist econometrician might calculate OLS estimates of the errors, "i, and call these residuals. The properties of these residuals can then be investigated to shed light on whether the assumptions underlying the model are reasonable. In the Bayesian context, the errors are given, for i D 1; : : : ; N, by

"i D yi f.Xi; /

We have assumed these errors to have various properties. In particular, we have assumed them to be i.i.d. N.0; h1/. These assumptions might be unreasonable

in a particular data set and, hence, the researcher may wish to test them. The brief statement ‘the errors are i.i.d. Normal’ involves many assumptions (e.g. the assumption that the errors are independent of one another, that they have a common variance, etc.), and the researcher might choose to investigate any of them. Here we will focus on aspects of the Normality assumption. Two of the properties of the Normal are that it is symmetric, and its tails have a particular shape. In terms of statistical jargon, the Normal distribution does not exhibit skewness and its tails have a particular kurtosis (see Appendix B, Definition B.8). Skewness and kurtosis are measured in terms of the third and fourth moments of the distribution and, for the standard Normal (i.e. the N.0; 1/) the third moment is zero and fourth moment is three. The Normality assumption thus implies that the following commonly-used measures of skewness and excess kurtosis should both be zero: Skew D p N N X i D1 "3 i " _N X i D1 "2 i #3 2 (5.17) and Kurt D N N X i D1 "4 i " _N X i D1 "2 i #23 (5.18)

These measures of skewness and excess kurtosis cannot be calculated directly since "i is unobserved. The frequentist econometrician would replace"i by the appropriate OLS residual in the preceding formulae and use the result to carry out a test for skewness or excess kurtosis. A finding that either skewness or excess kurtosis in the residuals indicates that the Normality assumption is an inappropriate one.

A Bayesian analogue to this frequentist procedure would be to calculate the expected value of either (5.17) and (5.18) and see whether they are reasonable. Formally, E [Skewjy] D E 8 > > > > > > < > > > > > > : p N N X i D1 [yi f.Xi; /]3 " _N X i D1 [yi f.Xi; /]2 #3 2 þ þ þ þ þ þ þ þ þ þ þ þ y 9 > > > > > > = > > > > > > ;

is something that we can calculate in a straightforward fashion once we have a posterior simulator. That is, Skew is simply a function of the model parameters

(and the data) and, hence, its posterior mean can be calculated in the same way as the posterior mean of any function of interest can be calculated (e.g. see (5.7)). E [Kurtjy] can be calculated in the same fashion. If the Normality assumption is a reasonable one, E [Kurtjy] and E [Skewjy] should both be roughly zero.

Let us now return to the topic of posterior predictive p-values which can be used to formalize the ideas of the previous paragraph. As stressed in the previous paragraph, E [Skewjy] and E [Kurtjy] are functions of the observed data and can be calculated using the posterior simulator. For any observable data, y†, E [Skewjy†] and E [Kurtjy†] can be calculated in the same fashion. If we calculate these latter functions for a wide variety of observable data sets, we can obtain distributions of values for skewness and excess kurtosis, respectively, that this model is able to generate. If either E [Skewjy] or E [Kurtjy] lie far out in the tails of the distribution of E [Skewjy†] and E [Kurtjy†] this is evidence against the assumption of Normality. It is worth stressing that E [Skewjy] or E [Kurtjy] are both simply numbers whereas E [Skewjy†] and E [Kurtjy†] are both random variables with probability distributions calculated using (5.13). In terms of our previous notation, we are setting g.y/ D E[Skewjy] or E[Kurtjy] and g.y†/ D E[Skewjy†] or E [Kurtjy†].

In practice, a program which calculates posterior predictive p-values for skewness for the nonlinear regression model using the noninformative prior would involve the following steps. The case of excess kurtosis (or any other function of interest) can be done in the same manner. These steps assume that you have derived a posterior simulator (i.e. a Metropolis–Hastings algorithm which is pro- ducing draws from the posterior). Details for how such a posterior simulator can be programmed up are given in the previous section.

Step 1: Take a draw,.s/; using the posterior simulator.

Step 2: Generate a representative data set, y†.s/, from p.y†j.s// using (5.15). Step 3: Set".s/_i Dyif.Xi; .s// for i D 1; : : : ; N and evaluate (5.17) at this

point to obtain Skew.s/.

Step 4: Set"_i†.s/D y_i†.s/ f.Xi; .s// for i D 1; : : : ; N and evaluate (5.17) at this point to obtain Skew†.s/.

Step 5: Repeat Steps 1, 2, 3 and 4 S times.

Step 6: Take the average of the S draws Skew.1/; : : : ; Skew.S/ to give an esti- mate of E [Skewjy].

Step 7: Calculate the proportion of the S draws Skew†.1/; : : : ; Skew†.S/ which are smaller than your estimate of E [Skewjy] from Step 6. If this number is less than 0:5 then it is your estimate of the posterior predictive p-value. Otherwise the posterior predictive p-value is one minus this number. There is no hard and fast rule for exactly what value of the posterior predictive p-value should be taken as evidence against a model. A useful rule of thumb is to take a posterior predictive p-value of less than 0:05 (or 0:01) as evidence against

a model. Remember that, if the posterior predictive p-value for skewness is equal to 0:05, then we can say “This model generates measures of skewness greater than the one actually observed only five percent of the time. Hence, it is fairly unlikely that this model generated the observed data.”

5.7 MODEL COMPARISON:

In document DOCUMENTOS DE LICITACIÓN DSC-L-08/2014 (página 98-103)