Integral Prior Distributions for linear models and multiple comparison

(1)

Integral Prior Distributions for linear models and multiple comparison

Salmer´on D. and Cano J.A.

Universidad de Murcia

(2)

Integral priors for model selection and testing

The Markov chains associated with the integral priors Computation of Bayes factors with integral priors Multiple comparison

Variable selection

2 / 65

(3)

The problem

Two models

Mi : fi(· | θi), θi ∈ Θ_i, i = 1, 2 are under consideration to explain the data x

The Bayes factor

B21= m2(x)

m₁(x) = R f₂(x | θ₂)π₂(θ₂)dθ₂ R f₁(x | θ₁)π₁(θ₁)dθ₁ requires specification ofπ₁(θ₁) andπ₂(θ₂)

(4)

Estimation priors do not work

Default priors, π_i^N(θ_i) (Jeffreys or reference priors) are often used for estimation

Usually improper priors

π^N_i (θ_i) = c_ih_i(θ_i), c_i > 0, Z

h_i(θ_i)dθ_i = +∞

The Bayes factor is not well-defined B₂₁^N = R f₂(x | θ₂)π^N₂(θ₂)dθ₂

R f₁(x | θ1)π^N₁(θ1)dθ1

= c₂ c1

R f₂(x | θ₂)h₂(θ₂)dθ₂ R f₁(x | θ1)h1(θ1)dθ1

because c2/c1 is arbitrary

4 / 65

(5)

Proposals for model selection priors

(6)

Proposals for model selection priors

6 / 65

(7)

Proposals for variable selection priors

Zellner’s g-priors (1986) and Mixtures (Liang et al. (2008)) Robust prior (Bayarri et al. (2012))

Power-expected-posterior priors (Fouskakis et al. (2015))

(8)

Intrinsic priors for nested models

{π^I₁(θ1), π₂^I(θ2)}

π₂^I(θ2)= Z

π^I₂(θ2 | θ₁)π₁^I(θ1)dθ1

π^I₂(θ₂| θ₁) =R π₂^N(θ₂| x)f₁(x | θ₁)dx x is an imaginary minimal training sample

π^I₁(θ1) is free!!!

Usually π₁^I(θ₁):= π^N₁(θ₁) (Moreno et al. (1998))

8 / 65

(9)

Intrinsic priors for nested models

{π^I₁(θ1), π₂^I(θ2)}

π₂^I(θ2)= Z

π^I₂(θ2 | θ₁)π₁^I(θ1)dθ1

π^I₂(θ₂| θ₁) =R π₂^N(θ₂| x)f₁(x | θ₁)dx x is an imaginary minimal training sample

π^I₁(θ1) is free!!!

(10)

Expected posterior priors

π₁^E(θ1):=

Z

π^N₁(θ1 | x)m(x)dx π₂^E(θ₂):=

Z

π^N₂(θ₂ | x)m(x)dx

where x is an imaginary minimal training sample and m(x ) can be any predictive distribution, proper or not

Two proposals for m(x ) are the empirical distribution of the data, and the predictive distribution of the simplest model

9 / 65

(11)

Expected posterior priors for nested models

If M1 is nested in M2, and m(x ) := m^N₁(x ), then π^E_i (θi) = π_i^I(θi), i = 1, 2

The expected posterior priors can be seen as a generalization of intrinsic priors

(12)

Integral priors

11 / 65

(13)

Integral priors

Integral priors are the solutions π₁(θ₁) and π₂(θ₂) to the system of integral equations

π₁(θ₁) = Z

π^N₁(θ₁| x)m₂(x )dx π₂(θ₂) =

Z

π^N₂(θ₂| x)m₁(x )dx where

mi(x ) = Z

fi(x | θi)πi(θi)dθi, i = 1, 2, and x is an imaginary minimal training sample

(14)

Integral priors

Because of m_i(x ) =R f_i(x | θ_i)π_i(θ_i)dθ_i

π₁(θ₁)= Z

π₁^N(θ₁| x)f₂(x | θ₂)π₂(θ₂)dx dθ₂ π2(θ2)=

Z

π₂^N(θ2| x)f₁(x | θ1)π1(θ1)dx dθ1

Therefore we have a system of two integral equations, and the integral priors are its solution

13 / 65

(15)

Justification

Model selection priors should be close to the initial default priors Any prior π1(θ1) satisfies

π₁(θ₁) = Z

π₁(θ₁ | x)m₁(x )dx ,

and a sensible way to get a prior π₁(θ₁) close to π₁^N(θ₁) is by means of π₁(θ₁) =

Z

π^N₁(θ₁| x)m₁(x )dx

(16)

Justification

Model selection priors should be close to the initial default priors Any prior π1(θ1) satisfies

π₁(θ₁) = Z

π₁(θ₁ | x)m₁(x )dx ,

and a sensible way to get a prior π₁(θ₁) close to π₁^N(θ₁) is by means of π₁(θ₁) =

Z

π₁^N(θ₁| x)m₁(x )dx

14 / 65

(17)

Justification

The predictive distributionsmi(x ) =R f_i(x | θi)πi(θi)dθi, i = 1, 2, should be as close as possible.

Any prior π1(θ1) satisfies π₁(θ₁) =

Z

π₁(θ₁ | x)m₁(x )dx ,

and a sensible way to get m1(x ) and m2(x ) to be close is by means of π1(θ1) =

Z

π1(θ1| x)m₂(x )dx

(18)

Justification

The predictive distributionsmi(x ) =R f_i(x | θi)πi(θi)dθi, i = 1, 2, should be as close as possible.

Any prior π1(θ1) satisfies π₁(θ₁) =

Z

π₁(θ₁ | x)m₁(x )dx ,

and a sensible way to get m1(x ) and m2(x ) to be close is by means of π1(θ1) =

Z

π1(θ1| x)m₂(x )dx

15 / 65

(19)

(20)

17 / 65

(21)

In summary, a sensible way to get priors close to the initial default priors, and

with predictive distributions as close as possible (predictive

matching, see Berger and Pericchi (2001), and Bayarri et al. (2012)) is by means of

π₁(θ₁) = Z

π^N₁(θ₁| x)m₂(x )dx π₂(θ₂) =

Z

π^N₂(θ₂| x)m₁(x )dx and these are the integral priors!!!

(22)

Intrinsic and Integral priors

If M1 is nested in M2, then the intrinsic priors satisfy π₂(θ₂) =

Z

π₂^I(θ₂ | θ₁)π₁(θ₁)dθ₁ If we add the symmetrical equation

π1(θ1) = Z

π₁^I(θ1 | θ₂)π2(θ2)dθ2

Again we have the integral priors!!!

19 / 65

(23)

Intrinsic and Integral priors

If M1 is nested in M2, then the intrinsic priors satisfy π₂(θ₂) =

Z

π₂^I(θ₂ | θ₁)π₁(θ₁)dθ₁ If we add the symmetrical equation

π1(θ1) = Z

π₁^I(θ1 | θ₂)π2(θ2)dθ2

Again we have the integral priors!!!

(24)

The Markov chains associated with the integral priors

20 / 65

(25)

The associated Markov chains

The integral prior π1(θ1) is the invariant σ-finite measure of the Markov chain with transition θ₁ → θ⁰₁ defined by the following four steps

(26)

If this Markov chain is Harris recurrent, then the integral prior π1(θ1) can be approximated by simulation

Therefore, the transition θ1 → θ₁⁰ and the integral priors are essentially the same thing

There exists a parallel Markov chain for θ₂ with the same properties;

in particular, if one is (Harris) recurrent then so is the other

22 / 65

(27)

The first example

One-sided testing for the exponential distribution

M₁ : E xp(θ₁), θ₁ < 1 M2 : E xp(θ2), θ2 > 1

π₁^N(θ₁) ∝ θ₁⁻¹1_(0,1)(θ₁) π₂^N(θ2) ∝ θ₂⁻¹1_(1,+∞)(θ2)

(28)

Markov chain for θ₁

1 x⁰ = −θ₁log u₁

2 θ2= −x⁰/ log(u2(1 − e^−x⁰) + e^−x⁰)

3 x = −θ₂log u₃

4 θ⁰₁= (1 − ¹_xlog u4)

u1, u2, u3, u4 ∼ U(0, 1)

24 / 65

(29)

M₁ : θ₁ < 1 M₂ : θ₂> 1

π1(θ1) π2(θ2)

0.0 0.2 0.4 0.6 0.8 1.0

01234

1 3 5 7 9 11 13 15 17 19

0.00.10.20.30.40.5

(30)

Integral priors can be applied to nested and non-nested situations Priors close to the initial default priors, and with predictive distributions as close as possible

The integral prior for each model takes into account the existence of the other model

26 / 65

(31)

Group invariance

An important situation is when M₁ and M₂ have the same group invariance structure

In this situation right-Haar priors are exact predictive matching for minimal training samples (Berger, Pericchi and Varshavsky (1998)) Right-Haar priors are Integral priors when these priors are the initial default priors

(32)

Location models

M1: N(θ, 1), π₁^N(θ) = c1

M2: DE (λ, 1), π₂^N(λ) = c2

π1(θ) = 1 and π2(λ) = 1 are the integral priors

Because these priors are improper, we expect a lack of stability in their associated Markov chains

28 / 65

(33)

(34)

30 / 65

(35)

(36)

1 x⁰∼ f₁(x⁰ | θ₁)

2 θ₂ ∼ π^N₂(θ₂ | x⁰)

3 x ∼ f2(x | θ2)

4 θ₁⁰ ∼ π^N₁(θ⁰₁ | x)

1 x⁰ ∼ f₁^A(x⁰ | θ₁) ∝ f₁(x⁰| θ₁)IA(x )

2 θ2 ∼ π₂^N(θ2 | x⁰)

3 x ∼ f₂^A(x | θ₂) ∝ f₂(x⁰ | θ₂)IA(x )

4 θ⁰₁ ∼ π₁^N(θ₁⁰ | x)

32 / 65

(37)

1 x⁰∼ f₁(x⁰ | θ₁)

2 θ₂ ∼ π^N₂(θ₂ | x⁰)

3 x ∼ f2(x | θ2)

4 θ₁⁰ ∼ π^N₁(θ⁰₁ | x)

1 x⁰ ∼ f₁^A(x⁰ | θ₁) ∝ f₁(x⁰| θ₁)IA(x )

2 θ2∼ π₂^N(θ2 | x⁰)

3 x ∼ f₂^A(x | θ2) ∝ f2(x⁰ | θ₂)IA(x )

4 θ⁰₁∼ π₁^N(θ₁⁰ | x)

(38)

M₁: N(θ, 1), π₁^N(θ) = c₁ and M₂ : DE (λ, 1), π₂^N(λ) = c₂ The constraint x ∈ A = [−10, 10] on the imaginary trainig samples prevents the explosion of the chain

0 10000 20000 30000 40000

−30−20−100102030

Our recommendation is keeping the imaginary training samples within an interval ±5s about the sample mean

33 / 65

(39)

The only thing one needs to apply this methodology is

To simulate minimal training samples from fi(x | θi), which seems easy to do, and

To simulate from the posteriors π_i^N(θi | x), which usually is also easy to do, or it can be done using MCMC

(40)

The one way heteroscedastic ANOVA

M1 : µ1= µ2 = · · · = µk = µ M₂ : all the µ_i⁰s are not equal π₁^N(µ, σ₁, . . . , σ_k) ∝ (σ₁· · · σ_k)⁻¹ π^N₂(µ₁, . . . , µ_k, σ₁, . . . , σ_k) ∝ (σ₁· · · σ_k)⁻¹

Here the simulation from the posterior π₁^N(θ1| x) can not be performed directly

π^N₁(µ, σ1, . . . , σ_k | x) ∝

k

Y

i =1

σ⁻³_i exp

−(xi 1− µ)²+ (xi 2− µ)² 2σ_i²

.

We use Gibbs sampling within this step

35 / 65

(41)

1 x⁰ ∼ f₁^A(x⁰ | θ₁) ∝ f₁(x⁰| θ₁)IA(x⁰)

2 θ₂∼ π₂^N(θ₂ | x⁰)

3 x ∼ f₂^A(x | θ2) ∝ f2(x | θ2)IA(x )

4 θ⁰₁∼ π₁^N(θ₁⁰ | x) : Gibbs sampling with h ≥ 1 iterations

(42)

Four populations. 100,000 iterations for the Markov chain and h = 1, 10, 100, 200 iterations of the Gibbs sampling

There are no differences from h = 10 to h = 100 or larger, so h = 10 is enough for the Gibbs algorithm

Integral prior for model M₂ concentrates mass in favor of model M₁

37 / 65

(43)

Cano, J. A., Kessler, M. and Salmer´on, D. (2007a). Integral priors for the one way random effects model. Bayesian Analysis, 2-1, 59–68.

Cano, J. A., Kessler, M. and Salmer´on, D. (2007b). A synopsis of integral priors for the one way random effects model. Bayesian Statistics, 8, 577–582. Oxford University Press.

Cano, J. A., Salmer´on, D. and Robert, C. P. (2008). Integral equation solutions as prior distributions for Bayesian model selection. Test, 17-3, 493–504.

Cano, J. A. and Salmer´on, D. (2013). Integral Priors and Constrained Imaginary Training Samples for Nested and Non-nested Bayesian Model Comparison. Bayesian Analysis, 8-2, 361–380.

(44)

Salmer´on, D., Cano, J. A. and Robert, C. P. (2015). Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Statistica Sinica, 25-3, 1009–1023.

Cano, J.A. and Salmer´on, D. (2016). A Review of the Developments on Integral Priors for Bayesian Model Selection. BEIO, 32-2, 96-111.

Cano, J.A., Iniesta M. and Salmer´on, D. Integral priors for Bayesian model selection: How they operate from simple to complex cases. Submitted.

39 / 65

(45)

Computation of Bayes factors with integral priors

Monte Carlo

Laplace approximation Importance sampling

(46)

Monte Carlo

The Markov chain θ⁽¹⁾_i , θ⁽²⁾_i , . . . for π_i(θ_i)

L→+∞lim 1 L

L

X

t=1

f_i(x | θi^t) = m_i(x) = Z

f_i(x | θi)π_i(θ_i)dθ_i

Very large values of L are needed if f_i(x | θi) is concentrated relative to π_i(θ_i)

41 / 65

(47)

Laplace approximation

m_i(x) = R fi(x | θi)π_i(θ_i)dθ_i ˆ

π_i is a nonparametric estimate of the integral prior π_i

ˆ

m_i(x) = (2π)^{dim(Θi )}² | ˆΣ_i|^1/2f_i(x | ˆθi)ˆπ_i( ˆθ_i)

θˆ_i = MLE

Σˆ⁻¹_i observed information matrix under M_i

(48)

Importance sampling I

ˆ

π_i is a nonparametric estimate of the integral prior π_i

m_i(x) =Z

f_i(x | θi)π_i(θ_i)dθ_i ≈ Z

f_i(x | θi)ˆπ_i(θ_i)dθ_i

=

Z f_i(x | θi)ˆπ_i(θ_i)

p(θ_i |x) p(θ_i |x)dθi

p(θ_i |x) the importance density

43 / 65

(49)

Importance sampling II

m₂(x) =Z

f₂(x | θ2)π₂(θ₂)dθ₂ = Z

f₂(x | θ2)π₂^N(θ₂| z₂)m₁(z₂)dz₂dθ₂

=

Z f2(x | θ2)π₂^N(θ2| z₂)

p(θ2 |x, z2) p(θ2 |x, z2)m1(z2)dz2dθ2

p(θ2 |x, z2) the importance density

The simulation of the Markov chain gives us simulations from m1(z2)

(50)

Multiple comparison

45 / 65

(51)

Integral priors for Multiple comparison

𝑀_𝑖

𝑀₁ 𝑀₂

𝑀_𝑞

𝑄_𝑖1(𝜃_𝑖^′|𝜃_𝑖) 𝑄_𝑖2(𝜃_𝑖^′|𝜃_𝑖)

𝑄_𝑖𝑞(𝜃_𝑖^′|𝜃_𝑖)

𝑄_𝑖 𝜃_𝑖^′ 𝜃_𝑖 = ^𝑗≠𝑖𝑄_𝑖𝑗(𝜃_𝑖^′|𝜃_𝑖) 𝑞 − 1

(52)

Integral priors for Multiple comparison

Definition: π_i(θ_i) forM_i

The integral priorπ_i(θ_i) is the invariant σ-finite measure of the Markov chain with transition θi → θ_i⁰ defined by the following four steps

𝑀

_𝑖

𝑀

_𝑗

47 / 65

(53)

Testing for the exponential distribution

M₁: E xp(θ₁), θ₁∈ I₁ = (0, 1) M2: E xp(θ2), θ2∈ I₂ = (1, +∞) M3: E xp(θ3), θ3∈ I₃ = (0, +∞)

π^N_i (θ_i) ∝ θ⁻¹_i 1_I_i(θ_i) ξi = log θi

i = 1, 2, 3

(54)

ξ₁ < 0 ξ₂ > 0 ξ₃∈ R

−20 −15 −10 −5 0

0.00.10.20.30.40.50.6

0 5 10 15 20 25

0.00.10.20.30.4

−20 −10 0 10 20

0.000.020.040.060.080.100.120.14

49 / 65

(55)

Variable selection

(56)

Variable selection for the linear regression model

Full model

y = X β + ε, ε ∼ Nn(0, σ²I) π^N(β, σ) ∝ 1/σ

β ∈ R^k, σ > 0

X = [x₁, . . . , x_k] an n × k full rank matrix and n > k x_j = (x_1j, . . . , x_nj)⁰, j = 1, . . . , k

Usually x₁ = 1_n

51 / 65

(57)

Submodels

The full model is represented by the matrix X = (xij) ∈ R^n×k R a subsequence of I = {1, . . . , n} representing rows of X C a subsequence of J = {1, . . . , k} representing columns of X XR,C = (xij)i ∈R,j ∈C and XC= XI,C

The submodel MC is represented by the matrix XC

MC :y = XCβC+ εC, εC ∼ N_n(0, σ_C²I) π^N(βC, σC) ∝ 1/σC

(58)

M

_C₁

versus M

_C₂

(βC1, σC1) → (β_C⁰

1, σ⁰_C

1)

Select random sequences, R and S, from I = {1, . . . , n}, with

|R| = |C₂| + 1 and |S| = |C₁| + 1, such that X_RC₂ and XSC₁ be full rank matrices.

1 Simulate a training sample y₂∼ N(XRC₁βC₁, σ²_C

1I)

2 Simulate the posterior π^N(βC₂, σC₂ | y₂, XRC₂)

3 Simulate a training sample y1∼ N(X_SC₂βC2, σ_C²₂I)

4 Simulate the posterior π^N(β_C⁰

1, σ⁰_C

1 | y₁, XSC₁)

54 / 65

(59)

The caterpillar dataset: 2

¹⁰

= 1024 models

Y = log of the average number of nests of caterpillars per tree in an area k = 10 potential explanatory variables defined on n = 33 areas

x₁ altitude x₂ slope

x3 number of pines in the area ...

Bayesian core: a practical approach to computational Bayesian statistics.

Jean-Michel Marin and Christian P. Robert. Springer.

(60)

Full model: Integral prior and π

^N

(θ | y)

56 / 65

(61)

Full model: Integral prior and π

^N

(θ | y)

−100 0 100 200

0.0000.0040.0080.012

−0.10 −0.05 0.00 0.05 0.10

05101520

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.00.51.01.5

−5 0 5

0.000.050.100.150.200.250.30

0.000.020.040.060.080.100.120.000.020.040.060.080.100.12 050100150200250050100150200250 051015051015 0123401234

(62)

Full model: Integral prior and π

^N

(θ | y)

−30 −20 −10 0 10 20 30

0.000.020.040.06

−4 −2 0 2 4 6

0.00.10.20.3

−100 −50 0 50 100

0.0000.0050.0100.0150.020

−60 −40 −20 0 20 40 60

0.0000.0100.0200.030

−3 −2 −1 0

0.00.10.20.30.40.50.60.7

−3 −2 −1 0

0.00.10.20.30.40.50.60.7

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

0123

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

0123

−6 −4 −2 0 2 4

0.000.050.100.150.200.25

−6 −4 −2 0 2 4

0.000.050.100.150.200.25

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

58 / 65

(63)

Full model: Integral prior and π

^N

(θ | y)

−10 −5 0 5 10 15

0.000.050.100.15

−60 −40 −20 0 20 40

0.000.010.020.030.04

−40 −20 0 20

0.000.010.020.030.040.050.06

0 5 10 15 20

0.000.100.200.30

0.00.51.01.50.00.51.01.5 0.00.10.20.30.40.00.10.20.30.4 0.00.10.20.30.40.50.00.10.20.30.40.5 0.00.51.01.52.02.53.00.00.51.01.52.02.53.0

(64)

The marginal distributions with integral priors

m(y) =Z

f (y | θ)π(θ)dθ =Z

f (y | θ)π^N(θ) π(θ) π^N(θ)dθ

≈ Z

f (y | θ)π^N(θ) π(θ)ˆ

π^N(θ)dθ ≈ π(ˆˆ θ) π^N(ˆθ)

Z

f (y | θ)π^N(θ)dθ

m(y) ≈ π(ˆˆ θ)

π^N(ˆθ)m^N(y)

60 / 65

(65)

(66)

Variable selection for Generalized linear models

For two Binomial regression models with a general link function

It can be done for multiple comparison!

1 Simulate a training sample y2

2 Simulate the posterior given y₂

4 Simulate the posterior given y₁

62 / 65

(67)

Variable selection for Generalized linear models

For two Binomial regression models with a general link function

It can be done for multiple comparison!

2 Simulate the posterior given y₂

(68)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ²I)

It can be done!

(βC1, σC1) → (β_C⁰

1, σ⁰_C

1)

1 Simulate a training sample y2:

it is easy

2 Simulate the posterior given y₂: MCMC if it is necessary

3 Simulate a training sample y1: it is easy

4 Simulate the posterior given y₁: MCMC if it is necessary

63 / 65

(69)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ²I)

It can be done!

(βC1, σC1) → (β_C⁰

1, σ⁰_C

1)

2 Simulate the posterior given y₂:

MCMC if it is necessary

(70)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ²I)

It can be done!

(βC1, σC1) → (β_C⁰

1, σ⁰_C

1)

3 Simulate a training sample y1:

it is easy

63 / 65

(71)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ²I)

It can be done!

(βC1, σC1) → (β_C⁰

1, σ⁰_C

1)

4 Simulate the posterior given y1:

MCMC if it is necessary

(72)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ²I)

It can be done!

(βC1, σC1) → (β_C⁰

1, σ⁰_C

1)

4 Simulate the posterior given y1: MCMC if it is necessary

63 / 65

(73)

Conclusions and oncoming research

The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions π_i^N(θ_i |z)

This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies

Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress

(74)

Conclusions and oncoming research

64 / 65

(75)

Conclusions and oncoming research

Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general

Computation of Bayes factors with integral priors is work in progress

(76)

Conclusions and oncoming research

64 / 65

(77)