Integral Prior Distributions for linear models and multiple comparison
Salmer´on D. and Cano J.A.
Universidad de Murcia
Integral priors for model selection and testing
The Markov chains associated with the integral priors Computation of Bayes factors with integral priors Multiple comparison
Variable selection
2 / 65
The problem
Two models
Mi : fi(· | θi), θi ∈ Θi, i = 1, 2 are under consideration to explain the data x
The Bayes factor
B21= m2(x)
m1(x) = R f2(x | θ2)π2(θ2)dθ2 R f1(x | θ1)π1(θ1)dθ1 requires specification ofπ1(θ1) andπ2(θ2)
Estimation priors do not work
Default priors, πiN(θi) (Jeffreys or reference priors) are often used for estimation
Usually improper priors
πNi (θi) = cihi(θi), ci > 0, Z
hi(θi)dθi = +∞
The Bayes factor is not well-defined B21N = R f2(x | θ2)πN2(θ2)dθ2
R f1(x | θ1)πN1(θ1)dθ1
= c2 c1
R f2(x | θ2)h2(θ2)dθ2 R f1(x | θ1)h1(θ1)dθ1
because c2/c1 is arbitrary
4 / 65
Proposals for model selection priors
Proposals for model selection priors
6 / 65
Proposals for variable selection priors
Zellner’s g-priors (1986) and Mixtures (Liang et al. (2008)) Robust prior (Bayarri et al. (2012))
Power-expected-posterior priors (Fouskakis et al. (2015))
Intrinsic priors for nested models
{πI1(θ1), π2I(θ2)}
π2I(θ2)= Z
πI2(θ2 | θ1)π1I(θ1)dθ1
πI2(θ2| θ1) =R π2N(θ2| x)f1(x | θ1)dx x is an imaginary minimal training sample
πI1(θ1) is free!!!
Usually π1I(θ1):= πN1(θ1) (Moreno et al. (1998))
8 / 65
Intrinsic priors for nested models
{πI1(θ1), π2I(θ2)}
π2I(θ2)= Z
πI2(θ2 | θ1)π1I(θ1)dθ1
πI2(θ2| θ1) =R π2N(θ2| x)f1(x | θ1)dx x is an imaginary minimal training sample
πI1(θ1) is free!!!
Expected posterior priors
π1E(θ1):=
Z
πN1(θ1 | x)m(x)dx π2E(θ2):=
Z
πN2(θ2 | x)m(x)dx
where x is an imaginary minimal training sample and m(x ) can be any predictive distribution, proper or not
Two proposals for m(x ) are the empirical distribution of the data, and the predictive distribution of the simplest model
9 / 65
Expected posterior priors for nested models
If M1 is nested in M2, and m(x ) := mN1(x ), then πEi (θi) = πiI(θi), i = 1, 2
The expected posterior priors can be seen as a generalization of intrinsic priors
Integral priors
11 / 65
Integral priors
Integral priors are the solutions π1(θ1) and π2(θ2) to the system of integral equations
π1(θ1) = Z
πN1(θ1| x)m2(x )dx π2(θ2) =
Z
πN2(θ2| x)m1(x )dx where
mi(x ) = Z
fi(x | θi)πi(θi)dθi, i = 1, 2, and x is an imaginary minimal training sample
Integral priors
Because of mi(x ) =R fi(x | θi)πi(θi)dθi
π1(θ1)= Z
π1N(θ1| x)f2(x | θ2)π2(θ2)dx dθ2 π2(θ2)=
Z
π2N(θ2| x)f1(x | θ1)π1(θ1)dx dθ1
Therefore we have a system of two integral equations, and the integral priors are its solution
13 / 65
Justification
Model selection priors should be close to the initial default priors Any prior π1(θ1) satisfies
π1(θ1) = Z
π1(θ1 | x)m1(x )dx ,
and a sensible way to get a prior π1(θ1) close to π1N(θ1) is by means of π1(θ1) =
Z
πN1(θ1| x)m1(x )dx
Justification
Model selection priors should be close to the initial default priors Any prior π1(θ1) satisfies
π1(θ1) = Z
π1(θ1 | x)m1(x )dx ,
and a sensible way to get a prior π1(θ1) close to π1N(θ1) is by means of π1(θ1) =
Z
π1N(θ1| x)m1(x )dx
14 / 65
Justification
The predictive distributionsmi(x ) =R fi(x | θi)πi(θi)dθi, i = 1, 2, should be as close as possible.
Any prior π1(θ1) satisfies π1(θ1) =
Z
π1(θ1 | x)m1(x )dx ,
and a sensible way to get m1(x ) and m2(x ) to be close is by means of π1(θ1) =
Z
π1(θ1| x)m2(x )dx
Justification
The predictive distributionsmi(x ) =R fi(x | θi)πi(θi)dθi, i = 1, 2, should be as close as possible.
Any prior π1(θ1) satisfies π1(θ1) =
Z
π1(θ1 | x)m1(x )dx ,
and a sensible way to get m1(x ) and m2(x ) to be close is by means of π1(θ1) =
Z
π1(θ1| x)m2(x )dx
15 / 65
17 / 65
In summary, a sensible way to get priors close to the initial default priors, and
with predictive distributions as close as possible (predictive
matching, see Berger and Pericchi (2001), and Bayarri et al. (2012)) is by means of
π1(θ1) = Z
πN1(θ1| x)m2(x )dx π2(θ2) =
Z
πN2(θ2| x)m1(x )dx and these are the integral priors!!!
Intrinsic and Integral priors
If M1 is nested in M2, then the intrinsic priors satisfy π2(θ2) =
Z
π2I(θ2 | θ1)π1(θ1)dθ1 If we add the symmetrical equation
π1(θ1) = Z
π1I(θ1 | θ2)π2(θ2)dθ2
Again we have the integral priors!!!
19 / 65
Intrinsic and Integral priors
If M1 is nested in M2, then the intrinsic priors satisfy π2(θ2) =
Z
π2I(θ2 | θ1)π1(θ1)dθ1 If we add the symmetrical equation
π1(θ1) = Z
π1I(θ1 | θ2)π2(θ2)dθ2
Again we have the integral priors!!!
The Markov chains associated with the integral priors
20 / 65
The associated Markov chains
The integral prior π1(θ1) is the invariant σ-finite measure of the Markov chain with transition θ1 → θ01 defined by the following four steps
If this Markov chain is Harris recurrent, then the integral prior π1(θ1) can be approximated by simulation
Therefore, the transition θ1 → θ10 and the integral priors are essentially the same thing
There exists a parallel Markov chain for θ2 with the same properties;
in particular, if one is (Harris) recurrent then so is the other
22 / 65
The first example
One-sided testing for the exponential distribution
M1 : E xp(θ1), θ1 < 1 M2 : E xp(θ2), θ2 > 1
π1N(θ1) ∝ θ1−11(0,1)(θ1) π2N(θ2) ∝ θ2−11(1,+∞)(θ2)
Markov chain for θ1
1 x0 = −θ1log u1
2 θ2= −x0/ log(u2(1 − e−x0) + e−x0)
3 x = −θ2log u3
4 θ01= (1 − 1xlog u4)
u1, u2, u3, u4 ∼ U(0, 1)
24 / 65
M1 : θ1 < 1 M2 : θ2> 1
π1(θ1) π2(θ2)
0.0 0.2 0.4 0.6 0.8 1.0
01234
1 3 5 7 9 11 13 15 17 19
0.00.10.20.30.40.5
Integral priors can be applied to nested and non-nested situations Priors close to the initial default priors, and with predictive distributions as close as possible
The integral prior for each model takes into account the existence of the other model
26 / 65
Group invariance
An important situation is when M1 and M2 have the same group invariance structure
In this situation right-Haar priors are exact predictive matching for minimal training samples (Berger, Pericchi and Varshavsky (1998)) Right-Haar priors are Integral priors when these priors are the initial default priors
Location models
M1: N(θ, 1), π1N(θ) = c1
M2: DE (λ, 1), π2N(λ) = c2
π1(θ) = 1 and π2(λ) = 1 are the integral priors
Because these priors are improper, we expect a lack of stability in their associated Markov chains
28 / 65
30 / 65
1 x0∼ f1(x0 | θ1)
2 θ2 ∼ πN2(θ2 | x0)
3 x ∼ f2(x | θ2)
4 θ10 ∼ πN1(θ01 | x)
1 x0 ∼ f1A(x0 | θ1) ∝ f1(x0| θ1)IA(x )
2 θ2 ∼ π2N(θ2 | x0)
3 x ∼ f2A(x | θ2) ∝ f2(x0 | θ2)IA(x )
4 θ01 ∼ π1N(θ10 | x)
32 / 65
1 x0∼ f1(x0 | θ1)
2 θ2 ∼ πN2(θ2 | x0)
3 x ∼ f2(x | θ2)
4 θ10 ∼ πN1(θ01 | x)
1 x0 ∼ f1A(x0 | θ1) ∝ f1(x0| θ1)IA(x )
2 θ2∼ π2N(θ2 | x0)
3 x ∼ f2A(x | θ2) ∝ f2(x0 | θ2)IA(x )
4 θ01∼ π1N(θ10 | x)
M1: N(θ, 1), π1N(θ) = c1 and M2 : DE (λ, 1), π2N(λ) = c2 The constraint x ∈ A = [−10, 10] on the imaginary trainig samples prevents the explosion of the chain
0 10000 20000 30000 40000
−30−20−100102030
Our recommendation is keeping the imaginary training samples within an interval ±5s about the sample mean
33 / 65
The only thing one needs to apply this methodology is
To simulate minimal training samples from fi(x | θi), which seems easy to do, and
To simulate from the posteriors πiN(θi | x), which usually is also easy to do, or it can be done using MCMC
The one way heteroscedastic ANOVA
M1 : µ1= µ2 = · · · = µk = µ M2 : all the µi0s are not equal π1N(µ, σ1, . . . , σk) ∝ (σ1· · · σk)−1 πN2(µ1, . . . , µk, σ1, . . . , σk) ∝ (σ1· · · σk)−1
Here the simulation from the posterior π1N(θ1| x) can not be performed directly
πN1(µ, σ1, . . . , σk | x) ∝
k
Y
i =1
σ−3i exp
−(xi 1− µ)2+ (xi 2− µ)2 2σi2
.
We use Gibbs sampling within this step
35 / 65
1 x0 ∼ f1A(x0 | θ1) ∝ f1(x0| θ1)IA(x0)
2 θ2∼ π2N(θ2 | x0)
3 x ∼ f2A(x | θ2) ∝ f2(x | θ2)IA(x )
4 θ01∼ π1N(θ10 | x) : Gibbs sampling with h ≥ 1 iterations
Four populations. 100,000 iterations for the Markov chain and h = 1, 10, 100, 200 iterations of the Gibbs sampling
There are no differences from h = 10 to h = 100 or larger, so h = 10 is enough for the Gibbs algorithm
Integral prior for model M2 concentrates mass in favor of model M1
37 / 65
Cano, J. A., Kessler, M. and Salmer´on, D. (2007a). Integral priors for the one way random effects model. Bayesian Analysis, 2-1, 59–68.
Cano, J. A., Kessler, M. and Salmer´on, D. (2007b). A synopsis of integral priors for the one way random effects model. Bayesian Statistics, 8, 577–582. Oxford University Press.
Cano, J. A., Salmer´on, D. and Robert, C. P. (2008). Integral equation solutions as prior distributions for Bayesian model selection. Test, 17-3, 493–504.
Cano, J. A. and Salmer´on, D. (2013). Integral Priors and Constrained Imaginary Training Samples for Nested and Non-nested Bayesian Model Comparison. Bayesian Analysis, 8-2, 361–380.
Salmer´on, D., Cano, J. A. and Robert, C. P. (2015). Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Statistica Sinica, 25-3, 1009–1023.
Cano, J.A. and Salmer´on, D. (2016). A Review of the Developments on Integral Priors for Bayesian Model Selection. BEIO, 32-2, 96-111.
Cano, J.A., Iniesta M. and Salmer´on, D. Integral priors for Bayesian model selection: How they operate from simple to complex cases. Submitted.
39 / 65
Computation of Bayes factors with integral priors
Monte Carlo
Laplace approximation Importance sampling
Monte Carlo
The Markov chain θ(1)i , θ(2)i , . . . for πi(θi)
L→+∞lim 1 L
L
X
t=1
fi(x | θit) = mi(x) = Z
fi(x | θi)πi(θi)dθi
Very large values of L are needed if fi(x | θi) is concentrated relative to πi(θi)
41 / 65
Laplace approximation
mi(x) = R fi(x | θi)πi(θi)dθi ˆ
πi is a nonparametric estimate of the integral prior πi
ˆ
mi(x) = (2π)dim(Θi )2 | ˆΣi|1/2fi(x | ˆθi)ˆπi( ˆθi)
θˆi = MLE
Σˆ−1i observed information matrix under Mi
Importance sampling I
ˆ
πi is a nonparametric estimate of the integral prior πi
mi(x) =Z
fi(x | θi)πi(θi)dθi ≈ Z
fi(x | θi)ˆπi(θi)dθi
=
Z fi(x | θi)ˆπi(θi)
p(θi |x) p(θi |x)dθi
p(θi |x) the importance density
43 / 65
Importance sampling II
m2(x) =Z
f2(x | θ2)π2(θ2)dθ2 = Z
f2(x | θ2)π2N(θ2| z2)m1(z2)dz2dθ2
=
Z f2(x | θ2)π2N(θ2| z2)
p(θ2 |x, z2) p(θ2 |x, z2)m1(z2)dz2dθ2
p(θ2 |x, z2) the importance density
The simulation of the Markov chain gives us simulations from m1(z2)
Multiple comparison
45 / 65
Integral priors for Multiple comparison
𝑀𝑖
𝑀1 𝑀2
𝑀𝑞
𝑄𝑖1(𝜃𝑖′|𝜃𝑖) 𝑄𝑖2(𝜃𝑖′|𝜃𝑖)
𝑄𝑖𝑞(𝜃𝑖′|𝜃𝑖)
𝑄𝑖 𝜃𝑖′ 𝜃𝑖 = 𝑗≠𝑖𝑄𝑖𝑗(𝜃𝑖′|𝜃𝑖) 𝑞 − 1
Integral priors for Multiple comparison
Definition: πi(θi) forMi
The integral priorπi(θi) is the invariant σ-finite measure of the Markov chain with transition θi → θi0 defined by the following four steps
𝑀
𝑖𝑀
𝑗47 / 65
Testing for the exponential distribution
M1: E xp(θ1), θ1∈ I1 = (0, 1) M2: E xp(θ2), θ2∈ I2 = (1, +∞) M3: E xp(θ3), θ3∈ I3 = (0, +∞)
πNi (θi) ∝ θ−1i 1Ii(θi) ξi = log θi
i = 1, 2, 3
ξ1 < 0 ξ2 > 0 ξ3∈ R
−20 −15 −10 −5 0
0.00.10.20.30.40.50.6
0 5 10 15 20 25
0.00.10.20.30.4
−20 −10 0 10 20
0.000.020.040.060.080.100.120.14
49 / 65
Variable selection
Variable selection for the linear regression model
Full model
y = X β + ε, ε ∼ Nn(0, σ2I) πN(β, σ) ∝ 1/σ
β ∈ Rk, σ > 0
X = [x1, . . . , xk] an n × k full rank matrix and n > k xj = (x1j, . . . , xnj)0, j = 1, . . . , k
Usually x1 = 1n
51 / 65
Submodels
The full model is represented by the matrix X = (xij) ∈ Rn×k R a subsequence of I = {1, . . . , n} representing rows of X C a subsequence of J = {1, . . . , k} representing columns of X XR,C = (xij)i ∈R,j ∈C and XC= XI,C
The submodel MC is represented by the matrix XC
MC :y = XCβC+ εC, εC ∼ Nn(0, σC2I) πN(βC, σC) ∝ 1/σC
M
C1versus M
C2(βC1, σC1) → (βC0
1, σ0C
1)
Select random sequences, R and S, from I = {1, . . . , n}, with
|R| = |C2| + 1 and |S| = |C1| + 1, such that XRC2 and XSC1 be full rank matrices.
1 Simulate a training sample y2∼ N(XRC1βC1, σ2C
1I)
2 Simulate the posterior πN(βC2, σC2 | y2, XRC2)
3 Simulate a training sample y1∼ N(XSC2βC2, σC22I)
4 Simulate the posterior πN(βC0
1, σ0C
1 | y1, XSC1)
54 / 65
The caterpillar dataset: 2
10= 1024 models
Y = log of the average number of nests of caterpillars per tree in an area k = 10 potential explanatory variables defined on n = 33 areas
x1 altitude x2 slope
x3 number of pines in the area ...
Bayesian core: a practical approach to computational Bayesian statistics.
Jean-Michel Marin and Christian P. Robert. Springer.
Full model: Integral prior and π
N(θ | y)
56 / 65
Full model: Integral prior and π
N(θ | y)
−100 0 100 200
0.0000.0040.0080.012
−0.10 −0.05 0.00 0.05 0.10
05101520
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.00.51.01.5
−5 0 5
0.000.050.100.150.200.250.30
0.000.020.040.060.080.100.120.000.020.040.060.080.100.12 050100150200250050100150200250 051015051015 0123401234
Full model: Integral prior and π
N(θ | y)
−30 −20 −10 0 10 20 30
0.000.020.040.06
−4 −2 0 2 4 6
0.00.10.20.3
−100 −50 0 50 100
0.0000.0050.0100.0150.020
−60 −40 −20 0 20 40 60
0.0000.0100.0200.030
−3 −2 −1 0
0.00.10.20.30.40.50.60.7
−3 −2 −1 0
0.00.10.20.30.40.50.60.7
−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0123
−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0123
−6 −4 −2 0 2 4
0.000.050.100.150.200.25
−6 −4 −2 0 2 4
0.000.050.100.150.200.25
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
58 / 65
Full model: Integral prior and π
N(θ | y)
−10 −5 0 5 10 15
0.000.050.100.15
−60 −40 −20 0 20 40
0.000.010.020.030.04
−40 −20 0 20
0.000.010.020.030.040.050.06
0 5 10 15 20
0.000.100.200.30
0.00.51.01.50.00.51.01.5 0.00.10.20.30.40.00.10.20.30.4 0.00.10.20.30.40.50.00.10.20.30.40.5 0.00.51.01.52.02.53.00.00.51.01.52.02.53.0
The marginal distributions with integral priors
m(y) =Z
f (y | θ)π(θ)dθ =Z
f (y | θ)πN(θ) π(θ) πN(θ)dθ
≈ Z
f (y | θ)πN(θ) π(θ)ˆ
πN(θ)dθ ≈ π(ˆˆ θ) πN(ˆθ)
Z
f (y | θ)πN(θ)dθ
m(y) ≈ π(ˆˆ θ)
πN(ˆθ)mN(y)
60 / 65
Variable selection for Generalized linear models
For two Binomial regression models with a general link function
It can be done for multiple comparison!
1 Simulate a training sample y2
2 Simulate the posterior given y2
3 Simulate a training sample y1
4 Simulate the posterior given y1
62 / 65
Variable selection for Generalized linear models
For two Binomial regression models with a general link function
It can be done for multiple comparison!
1 Simulate a training sample y2
2 Simulate the posterior given y2
3 Simulate a training sample y1
Variable selection for Nonlinear regression models
y = g(X , β) + ε, ε ∼ Nn(0, σ2I)
It can be done!
(βC1, σC1) → (βC0
1, σ0C
1)
1 Simulate a training sample y2:
it is easy
2 Simulate the posterior given y2: MCMC if it is necessary
3 Simulate a training sample y1: it is easy
4 Simulate the posterior given y1: MCMC if it is necessary
63 / 65
Variable selection for Nonlinear regression models
y = g(X , β) + ε, ε ∼ Nn(0, σ2I)
It can be done!
(βC1, σC1) → (βC0
1, σ0C
1)
1 Simulate a training sample y2: it is easy
2 Simulate the posterior given y2:
MCMC if it is necessary
3 Simulate a training sample y1: it is easy
4 Simulate the posterior given y1: MCMC if it is necessary
Variable selection for Nonlinear regression models
y = g(X , β) + ε, ε ∼ Nn(0, σ2I)
It can be done!
(βC1, σC1) → (βC0
1, σ0C
1)
1 Simulate a training sample y2: it is easy
2 Simulate the posterior given y2: MCMC if it is necessary
3 Simulate a training sample y1:
it is easy
4 Simulate the posterior given y1: MCMC if it is necessary
63 / 65
Variable selection for Nonlinear regression models
y = g(X , β) + ε, ε ∼ Nn(0, σ2I)
It can be done!
(βC1, σC1) → (βC0
1, σ0C
1)
1 Simulate a training sample y2: it is easy
2 Simulate the posterior given y2: MCMC if it is necessary
3 Simulate a training sample y1: it is easy
4 Simulate the posterior given y1:
MCMC if it is necessary
Variable selection for Nonlinear regression models
y = g(X , β) + ε, ε ∼ Nn(0, σ2I)
It can be done!
(βC1, σC1) → (βC0
1, σ0C
1)
1 Simulate a training sample y2: it is easy
2 Simulate the posterior given y2: MCMC if it is necessary
3 Simulate a training sample y1: it is easy
4 Simulate the posterior given y1: MCMC if it is necessary
63 / 65
Conclusions and oncoming research
The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiN(θi |z)
This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies
Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress
Conclusions and oncoming research
The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiN(θi |z)
This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies
Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress
64 / 65
Conclusions and oncoming research
The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiN(θi |z)
This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies
Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general
Computation of Bayes factors with integral priors is work in progress
Conclusions and oncoming research
The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiN(θi |z)
This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies
Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress
64 / 65