• No se han encontrado resultados

El Centro de Agro-Cultura para refugiados

独 Centro de Agro-cultura para refugiados 独

5. El Centro de Agro-Cultura para refugiados

Implementation of the Bayesian approach as indicated in the previous sub-section depends on a willingness to assign probability distributions not only to data variables like y, but also to parameters like

may or may not be consistent with the usual long-run frequency notion of probability. For example, if

= true probability of success for a new surgical procedure, then it is possible (at least conceptually) to think of

of the observed success rate as the procedure is independently repeated again and again. But if

= true proportion of U.S. men who are HIV-positive,

the long-term frequency notion does not apply - it is not possible to even imagine "running the HIV epidemic over again" and reobserving

over, the randomness in

an accurate census of all men and their HIV status were available, be computed exactly. Rather, here

to us, though we may have some feelings about it (say, that more likely than

in subjective probability, wherein we quantify whatever feelings (however vague) we may have about

in (2.1) or (2.3)) with the resulting posterior distribution reflecting a blend of the information in the data and the prior.

Historically, a major impediment to widespread use of the Bayesian paradigm has been that determination of the appropriate form of the prior distributions are specified based on information accumulated from past studies or from the opinions of subject-area experts. In order to streamline the elicitation process, as well as simplify the subsequent computational burden, experimenters often limit this choice somewhat by restricting some familiar distributional family. An even simpler alternative, available

. Such a requirement

as the limiting value

. More-does not arise from any real-world mechanism; if could is random only because it is unknown

= .05 is

= .50). Bayesian analysis is predicated on such a belief before we look at the data y in a distribution . This distribution is then updated by the data via Bayes' Theorem (as

(and perhaps the hyperprior h) is often an arduous task. Typically, these

to

in some cases, is to endow the prior distribution with little informative content, so that the data from the current study will be the dominant force in determining the posterior distribution. We address each of these

approaches in turn.

2.2.1 Elicited Priors

Suppose for the moment that to specifying

values deemed "possible," and subsequently to assign probability masses to these values in such a way that their sum is 1, their relative contribu-tions reflecting the experimenter's prior beliefs as closely as possible. If is discrete-valued, such an approach may be quite natural, though perhaps quite time-consuming. If ,

to intervals on the real line, rather than to single points, resulting in a his-togram prior for

may seem inappropriate, especially in concert with a continuous likelihood but may in fact be more appropriate if the integrals required to compute the posterior distribution must be done numerically, since com-puters are themselves discrete finite instruments. Moreover, a histogram prior may have as many bins as the patience of the elicitee and the accu-racy of his prior opinion will allow. It is vitally important, however, that the range of the histogram be sufficiently wide, since, as can be seen from (2.1), the support of the posterior will necessarily be a subset of that of the prior. That is, the data cannot lend credence to intervals for

"impossible" in the prior.

Alternatively, we might simply assume that the prior for a parametric distributional family

matches the elicitee's true prior beliefs as nearly as possible. For exam-ple, if

mean and the variance) or two quantiles (say, the 50th and 95th) would be sufficient to determine its exact value. This approach limits the effort required of the elicitee, and also overcomes the finite support problem in-herent in the histogram approach. It may also lead to simplifications in the posterior computation, as we shall see in Subsection 2.2.2.

A limitation of this approach is of course that it may not be possible for the elicitee to "shoehorn" his or her prior beliefs into any of the standard parametric forms. In addition, two distributions which look virtually iden-tical may in fact have quite different properties. For example, Berger (1985, p. 79) points out that the Cauchy(0,1) and Normal(0, 2.19) distributions have identical 25th, 50th, and 75th percentiles (-1, 0, and l, respectively) and density functions that appear very similar when plotted, yet may lead to quite different posterior distributions.

An opportunity to experiment with these two types of prior elicitation is univariate. Perhaps the simplest approach is first to limit consideration to a manageable collection of

is continuous we must instead assign the masses . Such a histogram (necessarily over a bounded region)

deemed belongs to choosing so that the result were two-dimensional, then specification of two moments (say, the

in a simple setting is afforded below in Exercise 4. We now illustrate mul-tivariate elicitation in a challenging, real-life setting.

Example 2.3 In attempting to model the probability

response by person j in cue group i on a working memory exam question having k "simple conditions" and "query complexity" level l, Carlin, Kass, Lerch, and Huguenard (1992) consider the model

In this model,

to the number of simple conditions in the question,

increase in the logit error rate in moving from complexity level 0 to 1, and does the same in moving from complexity level 1 to 2. Prior information concerning the parameters was based primarily on rough impressions, pre-cluding precise prior elicitation. As such,

product of independent, normally distributed components, namely, (2.6) where N represents the univariate normal distribution and

the bivariate normal distribution. Hyperparameter mean parameters were chosen by eliciting "most likely" error rates, and subsequently converting to the logit scale. Next, variances were determined by considering the effect on the error rate scale of a postulated standard deviation (and resulting

95% confidence set) on the logit scale. Note that the symmetry imposed by the normal priors on the logit scale does not carry over to the error rate (probability) scale. While neither normality nor prior independence was deemed totally realistic, prior (2.6) was able to provide a suitable starting place, subsequently updated by the information in the experimental data.

Even when scientifically relevant prior information is available, elicitation of the precise forms for the prior distribution from the experimenters can be a long and tedious process. Some progress has been made in this area for normal linear models, most notably by Kadane et al. (1980). In the main, however, prior elicitation issues tend to be application- and elicitee-specific, meaning that general purpose algorithms are typically unavailable. Wolfson (1995) and Chaloner (1996) provide overviews of the various philosophies of elicitation (based on the ways that humans think about and update probabilistic statements), as well as reviews of the methods proposed for various models and distributional settings.

The difficulty of prior elicitation has been ameliorated somewhat through of an incorrect

denotes the effect for subject ij, denotes the effect due measures the marginal

was defined as a

represents

the addition of interactive computing, especially dynamic graphics, and object-oriented computer languages such as S (Becker, Chambers, and Wilks, 1988) or XLISP-STAT (Tierney, 1990).

Example 2.4 In the arena of monitoring clinical trials, Chaloner et al.

(1993) show how to combine histogram elicitation, matching a functional form, and interactive graphical methods. Following the advice of Kadane et al. (1980), these authors elicit a prior not on

servable proportional hazards regression parameter), but on corresponding observable quantities familiar to their medically-oriented elicitees, namely the proportion of individuals failing within two years in a population of control patients,

a population of treated patients,

t for the controls as S(t), under the proportional hazards model we then have that

(2.7) gives the relationship between

Since the effect of a treatment is typically thought of by clinicians as being relative to the baseline (control) failure probability

elicit a best guess for this rate,

then elicit an entire distribution for p1, beginning with initial guesses for the upper and lower quartiles of p1's distribution. These values determine initial guesses for the parameters

corresponds on the

functional form is then displayed on the computer screen and the elicitee is allowed to experiment with new values for

an even better fit to his or her true prior beliefs. Finally, fine tuning of the density is allowed via mouse input directly onto the screen. The density is restandardized after each such change, with updated quantiles computed for the elicitee's approval. At the conclusion of this process, the final prior distribution is discretized onto a suitably fine grid of points ( ), and finally converted to a histogram-type prior on

use in computing its posterior distribution.

2.2.2 Conjugate priors

In choosing a prior belonging to a specific distributional family

choices may be more computationally convenient than others. In particular, it may be possible to select a member of that family which is conjugateto the likelihood f(y|

(1983b) showed that exponential families, from which we typically draw our likelihood functions, do in fact have conjugate priors, so that this approach

and

Conditional on this modal value, they they first

via equation (2.7), for and in order to obtain scale to an extreme value distribution. This smooth of a smooth density function that and

Hence the equation

= 1 - S(2)

= 1 - S(2) and

and the corresponding two-year failure proportion in . Writing the survivor function at time (in this case, an

unob-some

), that is, one that leads to a posterior distribution p(- |y) belonging to the same distributional family as the prior. Morris

will often be available in practice. The computational advantages are best illustrated through an example.

Example 2.5 Suppose that X is the number of pregnant women arriving at a particular hospital to deliver their babies during a given month. The discrete count nature of the data plus its natural interpretation as an arrival rate suggest adopting a Poisson likelihood,

To effect a Bayesian analysis we require a prior distribution for

support on the positive real line. A reasonably flexible choice is provided by the gamma distribution,

or

distribution has mean one-tailed (

affected by

the posterior density, we have

(2.8) Notice that since our intended result is a normalized function of

able to drop any multiplicative functions that do not depend on

example, in the first line we have dropped the marginal distribution m(x) in the denominator, since it is free of

that it is proportional to a gamma distribution with parameters and

that integrates to 1 and density functions uniquely determine distributions, we know that the posterior distribution for

the gamma is the conjugate family for the Poisson likelihood.

In this example, we might have guessed the identity of the conjugate prior by looking at the Poisson likelihood as a function of

x as we typically do. Such a viewpoint reveals the

is also the basis of the gamma density. This technique is useful for deter-mining conjugate priors in many exponential families, as we shall see as we continue. Since conjugate priors can permit posterior distributions to emerge without numerical integration, we shall use them in many of our subsequent examples, in order to concentrate on foundational concerns and defer messy computational issues until Chapter 5.

instead of structure that

), and that is indeed G(

. Since this is the only function proportional to (2.8) . ) But now looking at (2.8), we see . (For but its shape is not.) Using Bayes' Theorem (2.1) to obtain

1) or two-tailed ( > 1). (The scale of the distribution is variance , , and can have a shape that is either

since we assume it to be known. The gamma 's dependence on = ( )

in distributional shorthand. Note that we have suppressed having

we are

For multiparameter models, independent conjugate priors may often be specified for each parameter, leading to corresponding conjugate forms for

each conditionalposterior distribution.

Example 2.6 Consider once again the normal likelihood of Example 2.1, where now both

Looking at f as a function of to

exp(-normal distribution. Thus we assume that = fore. Next, looking at f as a function of

to

In fact, this is actually the reciprocal of a gamma: if Y= 1/X

Appendix Section A.2)

Thus we assume that

and scale parameters, respectively. Finally, we assume that prioriindependent, so that

conjugate prior specification, the two conditional posterior distributions emerge as expression (2.2) and

We remark that in the normal setting, closed forms may also emerge for the marginal posterior densities

as an exercise. In any case, the ability of conjugate priors to produce at least unidimensional conditional posteriors in closed form enables them to retain their importance even in very high dimensional settings. This occurs through the use of Markov chain Monte Carlo integration techniques, which construct a sample from the joint posterior by successively sampling from the individual conditional posterior distributions.

Finally, while a single conjugate prior may be inadequate to accurately reflect available prior knowledge, a finite mixture of conjugate priors may be sufficiently flexible (allowing multimodality, heavier tails, etc.) while still enabling simplified posterior calculations. For example, suppose we are given the likelihood

the two-component mixture prior where both

and are assumed unknown, so that

alone, we see an expression proportional /b), so that the conjugate prior is again given by the as be-we see a form proportional which is vaguely reminiscent of the gamma distribution.

then Inverse Gamma( with density function (see

where a and b are the shape and are a With this componentwise

and we leave the details

for the univariate parameter and adopt and belong to the family of priors conjugate for f. Then

But this distribution is improper,in that and hence does not When we move to unbounded parameter spaces, the situation is even less clear. Suppose that

prior would appear to be

Then the appropriate uniform is arguably noninformative for

as we shall see later in this subsection and in the example of Subsec-tion 2.3.4).

does not favor any one of the candidate such, is noninformative for

parameter space, say distribution

2.2.3 Noninformative Priors

As alluded to earlier, often no reliable prior information concerning ists, or an inference based solely on the data is desired. At first, it might

appear that Bayesian inference would be inappropriate in such settings, but this conclusion is a bit hasty. Suppose we could find a distribution

that contained "no information" about favor one

ble). We might refer to such a distribution as a noninformative prior for and argue that all of the information resulting in the posterior arose from the data, and hence all resulting inferences were completely ob-jective, rather than subjective. Such an approach is likely to be important if Bayesian methods are to compete successfully in practice with their pop-ular likelihood-based counterparts (e.g., maximum likelihood estimation).

But is such a "noninformative approach" possible?

In some cases, the answer is an unambiguous "yes." For example, suppose the parameter space is discrete and finite, i.e.,

clearly the distribution where

mixture of conjugate priors leads to a mixture of conjugate posteriors.

(though this conclusion can be questioned, then the uniform If instead we have a bounded continuous

values over any other, and, as Then value over another (provided both values were logically possi-in the sense that it did not

ex-and Hence a

a prior that is clearly not uniform. Thus, using uniformity as a universal appear appropriate for use as a prior. But even here, Bayesian inference is still possible if the integral with respect to

some finite value K. (Note that this will not necessarily be the case: the definition of f as a joint probability density requires only the finiteness of its integral with respect to x.) Then

Since the integral of this function with respect to

posterior density, and hence Bayesian inference may proceed as usual. Still, it is worth reemphasizing that care must be taken when using improper priors, since proper posteriors will not always result. A good example (to which we return in Chapter 5 in the context of computation for high-dimensional models) is given by random effects models for longitudinal data, where the dimension of the parameter space increases with the sample size. In such models, the information in the data may be insufficient to identify all the parameters and, consequently, at least some of the prior distributions on the individual parameters must be informative.

Noninformative priors are closely related to the notion of a reference though precisely what is meant by this term depends on the author.

Kass and Wasserman (1996) define such a prior as a conventional or "de-fault" choice - not necessarily noninformative, but a convenient place to begin the analysis. The definition of Bernardo (1979) stays closer to non-informativity, using an expected utility approach to obtain a prior which represents ignorance about

this approach (and others that are similar), a unique representation for the reference prior is not always possible. Box and Tiao (1973, p. 23) suggest that all that is important is that the data dominate whatever information is contained in the prior, since as long as this happens, the precise form of the prior is unimportant. They use the analogy of a properly run jury trial, wherein the weight of evidence presented in court dominates whatever prior ideas or biases may be held by the members of the jury.

Putting aside these subtle differences in definition for a moment, we ask the question of whether a uniform prior is always a good "reference."

Sadly, the answer is no, since it is not invariant under reparametrization.

As a simple example, suppose we claim ignorance concerning a univariate parameter

1,

this converts the support of the parameter to the whole real line. Then the prior on

the inverse transformation. Here we have and hence

of the likelihood equals

is 1, it is indeed a proper

in some formal sense. Unfortunately, using

defined on the positive real line, and adopt the prior p(

> 0. A sensible reparametrization is given by since

is given by where J = the Jacobian of

so that J =

definition of prior ignorance, it is possible that "ignorance about not imply "ignorance about

rely on the particular modeling context to provide the most reasonable parametrization and, subsequently, apply the uniform prior on this scale.

The prior (Jeffreys, 1961, p.181) offers a fairly easy-to-compute alternative that is invariant under transformation. In the univariate case, this prior is given by

where

Notice that the form of the likelihood helps to determine the prior (2.10), but not the actual observed data since we are averaging over x in equation (2.11).

It is left as an exercise to show that the Jeffreys prior is invariant to 1-1 transformation. That is, in the notation of the previous paragraph, it is true that

(2.12) so that computing the Jeffreys prior for

as computing the Jeffreys prior for Jacobian transformation to the

In the multiparameter case, the Jeffreys prior is given by

(2.13) where

information matrix, having ij-element

While (2.13) provides a general recipe for obtaining noninformative pri-ors, it can be cumbersome to use in high dimensions. A more common approach is to obtain a noninformative prior for each parameter individu-ally, and then form the joint prior simply as the product of these individual priors. This action is often justified on the grounds that "ignorance" is con-sistent with "independence," although since noninformative priors are often

While (2.13) provides a general recipe for obtaining noninformative pri-ors, it can be cumbersome to use in high dimensions. A more common approach is to obtain a noninformative prior for each parameter individu-ally, and then form the joint prior simply as the product of these individual priors. This action is often justified on the grounds that "ignorance" is con-sistent with "independence," although since noninformative priors are often