独 Centro de Agro-cultura para refugiados 独
5. El Centro de Agro-Cultura para refugiados
Implementation of the Bayesian approach as indicated in the previous sub-section depends on a willingness to assign probability distributions not only to data variables like y, but also to parameters like
may or may not be consistent with the usual long-run frequency notion of probability. For example, if
= true probability of success for a new surgical procedure, then it is possible (at least conceptually) to think of
of the observed success rate as the procedure is independently repeated again and again. But if
= true proportion of U.S. men who are HIV-positive,
the long-term frequency notion does not apply - it is not possible to even imagine "running the HIV epidemic over again" and reobserving
over, the randomness in
an accurate census of all men and their HIV status were available, be computed exactly. Rather, here
to us, though we may have some feelings about it (say, that more likely than
in subjective probability, wherein we quantify whatever feelings (however vague) we may have about
in (2.1) or (2.3)) with the resulting posterior distribution reflecting a blend of the information in the data and the prior.
Historically, a major impediment to widespread use of the Bayesian paradigm has been that determination of the appropriate form of the prior distributions are specified based on information accumulated from past studies or from the opinions of subject-area experts. In order to streamline the elicitation process, as well as simplify the subsequent computational burden, experimenters often limit this choice somewhat by restricting some familiar distributional family. An even simpler alternative, available
. Such a requirement
as the limiting value
. More-does not arise from any real-world mechanism; if could is random only because it is unknown
= .05 is
= .50). Bayesian analysis is predicated on such a belief before we look at the data y in a distribution . This distribution is then updated by the data via Bayes' Theorem (as
(and perhaps the hyperprior h) is often an arduous task. Typically, these
to
in some cases, is to endow the prior distribution with little informative content, so that the data from the current study will be the dominant force in determining the posterior distribution. We address each of these
approaches in turn.
2.2.1 Elicited Priors
Suppose for the moment that to specifying
values deemed "possible," and subsequently to assign probability masses to these values in such a way that their sum is 1, their relative contribu-tions reflecting the experimenter's prior beliefs as closely as possible. If is discrete-valued, such an approach may be quite natural, though perhaps quite time-consuming. If ,
to intervals on the real line, rather than to single points, resulting in a his-togram prior for
may seem inappropriate, especially in concert with a continuous likelihood but may in fact be more appropriate if the integrals required to compute the posterior distribution must be done numerically, since com-puters are themselves discrete finite instruments. Moreover, a histogram prior may have as many bins as the patience of the elicitee and the accu-racy of his prior opinion will allow. It is vitally important, however, that the range of the histogram be sufficiently wide, since, as can be seen from (2.1), the support of the posterior will necessarily be a subset of that of the prior. That is, the data cannot lend credence to intervals for
"impossible" in the prior.
Alternatively, we might simply assume that the prior for a parametric distributional family
matches the elicitee's true prior beliefs as nearly as possible. For exam-ple, if
mean and the variance) or two quantiles (say, the 50th and 95th) would be sufficient to determine its exact value. This approach limits the effort required of the elicitee, and also overcomes the finite support problem in-herent in the histogram approach. It may also lead to simplifications in the posterior computation, as we shall see in Subsection 2.2.2.
A limitation of this approach is of course that it may not be possible for the elicitee to "shoehorn" his or her prior beliefs into any of the standard parametric forms. In addition, two distributions which look virtually iden-tical may in fact have quite different properties. For example, Berger (1985, p. 79) points out that the Cauchy(0,1) and Normal(0, 2.19) distributions have identical 25th, 50th, and 75th percentiles (-1, 0, and l, respectively) and density functions that appear very similar when plotted, yet may lead to quite different posterior distributions.
An opportunity to experiment with these two types of prior elicitation is univariate. Perhaps the simplest approach is first to limit consideration to a manageable collection of
is continuous we must instead assign the masses . Such a histogram (necessarily over a bounded region)
deemed belongs to choosing so that the result were two-dimensional, then specification of two moments (say, the
in a simple setting is afforded below in Exercise 4. We now illustrate mul-tivariate elicitation in a challenging, real-life setting.
Example 2.3 In attempting to model the probability
response by person j in cue group i on a working memory exam question having k "simple conditions" and "query complexity" level l, Carlin, Kass, Lerch, and Huguenard (1992) consider the model
In this model,
to the number of simple conditions in the question,
increase in the logit error rate in moving from complexity level 0 to 1, and does the same in moving from complexity level 1 to 2. Prior information concerning the parameters was based primarily on rough impressions, pre-cluding precise prior elicitation. As such,
product of independent, normally distributed components, namely, (2.6) where N represents the univariate normal distribution and
the bivariate normal distribution. Hyperparameter mean parameters were chosen by eliciting "most likely" error rates, and subsequently converting to the logit scale. Next, variances were determined by considering the effect on the error rate scale of a postulated standard deviation (and resulting
95% confidence set) on the logit scale. Note that the symmetry imposed by the normal priors on the logit scale does not carry over to the error rate (probability) scale. While neither normality nor prior independence was deemed totally realistic, prior (2.6) was able to provide a suitable starting place, subsequently updated by the information in the experimental data.
Even when scientifically relevant prior information is available, elicitation of the precise forms for the prior distribution from the experimenters can be a long and tedious process. Some progress has been made in this area for normal linear models, most notably by Kadane et al. (1980). In the main, however, prior elicitation issues tend to be application- and elicitee-specific, meaning that general purpose algorithms are typically unavailable. Wolfson (1995) and Chaloner (1996) provide overviews of the various philosophies of elicitation (based on the ways that humans think about and update probabilistic statements), as well as reviews of the methods proposed for various models and distributional settings.
The difficulty of prior elicitation has been ameliorated somewhat through of an incorrect
denotes the effect for subject ij, denotes the effect due measures the marginal
was defined as a
represents
the addition of interactive computing, especially dynamic graphics, and object-oriented computer languages such as S (Becker, Chambers, and Wilks, 1988) or XLISP-STAT (Tierney, 1990).
Example 2.4 In the arena of monitoring clinical trials, Chaloner et al.
(1993) show how to combine histogram elicitation, matching a functional form, and interactive graphical methods. Following the advice of Kadane et al. (1980), these authors elicit a prior not on
servable proportional hazards regression parameter), but on corresponding observable quantities familiar to their medically-oriented elicitees, namely the proportion of individuals failing within two years in a population of control patients,
a population of treated patients,
t for the controls as S(t), under the proportional hazards model we then have that
(2.7) gives the relationship between
Since the effect of a treatment is typically thought of by clinicians as being relative to the baseline (control) failure probability
elicit a best guess for this rate,
then elicit an entire distribution for p1, beginning with initial guesses for the upper and lower quartiles of p1's distribution. These values determine initial guesses for the parameters
corresponds on the
functional form is then displayed on the computer screen and the elicitee is allowed to experiment with new values for
an even better fit to his or her true prior beliefs. Finally, fine tuning of the density is allowed via mouse input directly onto the screen. The density is restandardized after each such change, with updated quantiles computed for the elicitee's approval. At the conclusion of this process, the final prior distribution is discretized onto a suitably fine grid of points ( ), and finally converted to a histogram-type prior on
use in computing its posterior distribution.
2.2.2 Conjugate priors
In choosing a prior belonging to a specific distributional family
choices may be more computationally convenient than others. In particular, it may be possible to select a member of that family which is conjugateto the likelihood f(y|
(1983b) showed that exponential families, from which we typically draw our likelihood functions, do in fact have conjugate priors, so that this approach
and
Conditional on this modal value, they they first
via equation (2.7), for and in order to obtain scale to an extreme value distribution. This smooth of a smooth density function that and
Hence the equation
= 1 - S(2)
= 1 - S(2) and
and the corresponding two-year failure proportion in . Writing the survivor function at time (in this case, an
unob-some
), that is, one that leads to a posterior distribution p(- |y) belonging to the same distributional family as the prior. Morris
will often be available in practice. The computational advantages are best illustrated through an example.
Example 2.5 Suppose that X is the number of pregnant women arriving at a particular hospital to deliver their babies during a given month. The discrete count nature of the data plus its natural interpretation as an arrival rate suggest adopting a Poisson likelihood,
To effect a Bayesian analysis we require a prior distribution for
support on the positive real line. A reasonably flexible choice is provided by the gamma distribution,
or
distribution has mean one-tailed (
affected by
the posterior density, we have
(2.8) Notice that since our intended result is a normalized function of
able to drop any multiplicative functions that do not depend on
example, in the first line we have dropped the marginal distribution m(x) in the denominator, since it is free of
that it is proportional to a gamma distribution with parameters and
that integrates to 1 and density functions uniquely determine distributions, we know that the posterior distribution for
the gamma is the conjugate family for the Poisson likelihood.
In this example, we might have guessed the identity of the conjugate prior by looking at the Poisson likelihood as a function of
x as we typically do. Such a viewpoint reveals the
is also the basis of the gamma density. This technique is useful for deter-mining conjugate priors in many exponential families, as we shall see as we continue. Since conjugate priors can permit posterior distributions to emerge without numerical integration, we shall use them in many of our subsequent examples, in order to concentrate on foundational concerns and defer messy computational issues until Chapter 5.
instead of structure that
), and that is indeed G(
. Since this is the only function proportional to (2.8) . ) But now looking at (2.8), we see . (For but its shape is not.) Using Bayes' Theorem (2.1) to obtain
1) or two-tailed ( > 1). (The scale of the distribution is variance , , and can have a shape that is either
since we assume it to be known. The gamma 's dependence on = ( )
in distributional shorthand. Note that we have suppressed having
we are
For multiparameter models, independent conjugate priors may often be specified for each parameter, leading to corresponding conjugate forms for
each conditionalposterior distribution.
Example 2.6 Consider once again the normal likelihood of Example 2.1, where now both
Looking at f as a function of to
exp(-normal distribution. Thus we assume that = fore. Next, looking at f as a function of
to
In fact, this is actually the reciprocal of a gamma: if Y= 1/X
Appendix Section A.2)
Thus we assume that
and scale parameters, respectively. Finally, we assume that prioriindependent, so that
conjugate prior specification, the two conditional posterior distributions emerge as expression (2.2) and
We remark that in the normal setting, closed forms may also emerge for the marginal posterior densities
as an exercise. In any case, the ability of conjugate priors to produce at least unidimensional conditional posteriors in closed form enables them to retain their importance even in very high dimensional settings. This occurs through the use of Markov chain Monte Carlo integration techniques, which construct a sample from the joint posterior by successively sampling from the individual conditional posterior distributions.
Finally, while a single conjugate prior may be inadequate to accurately reflect available prior knowledge, a finite mixture of conjugate priors may be sufficiently flexible (allowing multimodality, heavier tails, etc.) while still enabling simplified posterior calculations. For example, suppose we are given the likelihood
the two-component mixture prior where both
and are assumed unknown, so that
alone, we see an expression proportional /b), so that the conjugate prior is again given by the as be-we see a form proportional which is vaguely reminiscent of the gamma distribution.
then Inverse Gamma( with density function (see
where a and b are the shape and are a With this componentwise
and we leave the details
for the univariate parameter and adopt and belong to the family of priors conjugate for f. Then
But this distribution is improper,in that and hence does not When we move to unbounded parameter spaces, the situation is even less clear. Suppose that
prior would appear to be
Then the appropriate uniform is arguably noninformative for
as we shall see later in this subsection and in the example of Subsec-tion 2.3.4).
does not favor any one of the candidate such, is noninformative for
parameter space, say distribution
2.2.3 Noninformative Priors
As alluded to earlier, often no reliable prior information concerning ists, or an inference based solely on the data is desired. At first, it might
appear that Bayesian inference would be inappropriate in such settings, but this conclusion is a bit hasty. Suppose we could find a distribution
that contained "no information" about favor one
ble). We might refer to such a distribution as a noninformative prior for and argue that all of the information resulting in the posterior arose from the data, and hence all resulting inferences were completely ob-jective, rather than subjective. Such an approach is likely to be important if Bayesian methods are to compete successfully in practice with their pop-ular likelihood-based counterparts (e.g., maximum likelihood estimation).
But is such a "noninformative approach" possible?
In some cases, the answer is an unambiguous "yes." For example, suppose the parameter space is discrete and finite, i.e.,
clearly the distribution where
mixture of conjugate priors leads to a mixture of conjugate posteriors.
(though this conclusion can be questioned, then the uniform If instead we have a bounded continuous
values over any other, and, as Then value over another (provided both values were logically possi-in the sense that it did not
ex-and Hence a
a prior that is clearly not uniform. Thus, using uniformity as a universal appear appropriate for use as a prior. But even here, Bayesian inference is still possible if the integral with respect to
some finite value K. (Note that this will not necessarily be the case: the definition of f as a joint probability density requires only the finiteness of its integral with respect to x.) Then
Since the integral of this function with respect to
posterior density, and hence Bayesian inference may proceed as usual. Still, it is worth reemphasizing that care must be taken when using improper priors, since proper posteriors will not always result. A good example (to which we return in Chapter 5 in the context of computation for high-dimensional models) is given by random effects models for longitudinal data, where the dimension of the parameter space increases with the sample size. In such models, the information in the data may be insufficient to identify all the parameters and, consequently, at least some of the prior distributions on the individual parameters must be informative.
Noninformative priors are closely related to the notion of a reference though precisely what is meant by this term depends on the author.
Kass and Wasserman (1996) define such a prior as a conventional or "de-fault" choice - not necessarily noninformative, but a convenient place to begin the analysis. The definition of Bernardo (1979) stays closer to non-informativity, using an expected utility approach to obtain a prior which represents ignorance about
this approach (and others that are similar), a unique representation for the reference prior is not always possible. Box and Tiao (1973, p. 23) suggest that all that is important is that the data dominate whatever information is contained in the prior, since as long as this happens, the precise form of the prior is unimportant. They use the analogy of a properly run jury trial, wherein the weight of evidence presented in court dominates whatever prior ideas or biases may be held by the members of the jury.
Putting aside these subtle differences in definition for a moment, we ask the question of whether a uniform prior is always a good "reference."
Sadly, the answer is no, since it is not invariant under reparametrization.
As a simple example, suppose we claim ignorance concerning a univariate parameter
1,
this converts the support of the parameter to the whole real line. Then the prior on
the inverse transformation. Here we have and hence
of the likelihood equals
is 1, it is indeed a proper
in some formal sense. Unfortunately, using
defined on the positive real line, and adopt the prior p(
> 0. A sensible reparametrization is given by since
is given by where J = the Jacobian of
so that J =
definition of prior ignorance, it is possible that "ignorance about not imply "ignorance about
rely on the particular modeling context to provide the most reasonable parametrization and, subsequently, apply the uniform prior on this scale.
The prior (Jeffreys, 1961, p.181) offers a fairly easy-to-compute alternative that is invariant under transformation. In the univariate case, this prior is given by
where
Notice that the form of the likelihood helps to determine the prior (2.10), but not the actual observed data since we are averaging over x in equation (2.11).
It is left as an exercise to show that the Jeffreys prior is invariant to 1-1 transformation. That is, in the notation of the previous paragraph, it is true that
(2.12) so that computing the Jeffreys prior for
as computing the Jeffreys prior for Jacobian transformation to the
In the multiparameter case, the Jeffreys prior is given by
(2.13) where
information matrix, having ij-element
While (2.13) provides a general recipe for obtaining noninformative pri-ors, it can be cumbersome to use in high dimensions. A more common approach is to obtain a noninformative prior for each parameter individu-ally, and then form the joint prior simply as the product of these individual priors. This action is often justified on the grounds that "ignorance" is con-sistent with "independence," although since noninformative priors are often
While (2.13) provides a general recipe for obtaining noninformative pri-ors, it can be cumbersome to use in high dimensions. A more common approach is to obtain a noninformative prior for each parameter individu-ally, and then form the joint prior simply as the product of these individual priors. This action is often justified on the grounds that "ignorance" is con-sistent with "independence," although since noninformative priors are often