• No se han encontrado resultados

Integral Prior Distributions for linear models and multiple comparison

N/A
N/A
Protected

Academic year: 2023

Share "Integral Prior Distributions for linear models and multiple comparison"

Copied!
77
0
0

Texto completo

(1)

Integral Prior Distributions for linear models and multiple comparison

Salmer´on D. and Cano J.A.

Universidad de Murcia

(2)

Integral priors for model selection and testing

The Markov chains associated with the integral priors Computation of Bayes factors with integral priors Multiple comparison

Variable selection

2 / 65

(3)

The problem

Two models

Mi : fi(· | θi), θi ∈ Θi, i = 1, 2 are under consideration to explain the data x

The Bayes factor

B21= m2(x)

m1(x) = R f2(x | θ222)dθ2 R f1(x | θ111)dθ1 requires specification ofπ11) andπ22)

(4)

Estimation priors do not work

Default priors, πiNi) (Jeffreys or reference priors) are often used for estimation

Usually improper priors

πNii) = cihii), ci > 0, Z

hii)dθi = +∞

The Bayes factor is not well-defined B21N = R f2(x | θ2N22)dθ2

R f1(x | θ1N11)dθ1

= c2 c1

R f2(x | θ2)h22)dθ2 R f1(x | θ1)h11)dθ1

because c2/c1 is arbitrary

4 / 65

(5)

Proposals for model selection priors

(6)

Proposals for model selection priors

6 / 65

(7)

Proposals for variable selection priors

Zellner’s g-priors (1986) and Mixtures (Liang et al. (2008)) Robust prior (Bayarri et al. (2012))

Power-expected-posterior priors (Fouskakis et al. (2015))

(8)

Intrinsic priors for nested models

I11), π2I2)}

π2I2)= Z

πI22 | θ11I1)dθ1

πI22| θ1) =R π2N2| x)f1(x | θ1)dx x is an imaginary minimal training sample

πI11) is free!!!

Usually π1I1):= πN11) (Moreno et al. (1998))

8 / 65

(9)

Intrinsic priors for nested models

I11), π2I2)}

π2I2)= Z

πI22 | θ11I1)dθ1

πI22| θ1) =R π2N2| x)f1(x | θ1)dx x is an imaginary minimal training sample

πI11) is free!!!

(10)

Expected posterior priors

π1E1):=

Z

πN11 | x)m(x)dx π2E2):=

Z

πN22 | x)m(x)dx

where x is an imaginary minimal training sample and m(x ) can be any predictive distribution, proper or not

Two proposals for m(x ) are the empirical distribution of the data, and the predictive distribution of the simplest model

9 / 65

(11)

Expected posterior priors for nested models

If M1 is nested in M2, and m(x ) := mN1(x ), then πEii) = πiIi), i = 1, 2

The expected posterior priors can be seen as a generalization of intrinsic priors

(12)

Integral priors

11 / 65

(13)

Integral priors

Integral priors are the solutions π11) and π22) to the system of integral equations

π11) = Z

πN11| x)m2(x )dx π22) =

Z

πN22| x)m1(x )dx where

mi(x ) = Z

fi(x | θiii)dθi, i = 1, 2, and x is an imaginary minimal training sample

(14)

Integral priors

Because of mi(x ) =R fi(x | θiii)dθi

π11)= Z

π1N1| x)f2(x | θ222)dx dθ2 π22)=

Z

π2N2| x)f1(x | θ111)dx dθ1

Therefore we have a system of two integral equations, and the integral priors are its solution

13 / 65

(15)

Justification

Model selection priors should be close to the initial default priors Any prior π11) satisfies

π11) = Z

π11 | x)m1(x )dx ,

and a sensible way to get a prior π11) close to π1N1) is by means of π11) =

Z

πN11| x)m1(x )dx

(16)

Justification

Model selection priors should be close to the initial default priors Any prior π11) satisfies

π11) = Z

π11 | x)m1(x )dx ,

and a sensible way to get a prior π11) close to π1N1) is by means of π11) =

Z

π1N1| x)m1(x )dx

14 / 65

(17)

Justification

The predictive distributionsmi(x ) =R fi(x | θiii)dθi, i = 1, 2, should be as close as possible.

Any prior π11) satisfies π11) =

Z

π11 | x)m1(x )dx ,

and a sensible way to get m1(x ) and m2(x ) to be close is by means of π11) =

Z

π11| x)m2(x )dx

(18)

Justification

The predictive distributionsmi(x ) =R fi(x | θiii)dθi, i = 1, 2, should be as close as possible.

Any prior π11) satisfies π11) =

Z

π11 | x)m1(x )dx ,

and a sensible way to get m1(x ) and m2(x ) to be close is by means of π11) =

Z

π11| x)m2(x )dx

15 / 65

(19)
(20)

17 / 65

(21)

In summary, a sensible way to get priors close to the initial default priors, and

with predictive distributions as close as possible (predictive

matching, see Berger and Pericchi (2001), and Bayarri et al. (2012)) is by means of

π11) = Z

πN11| x)m2(x )dx π22) =

Z

πN22| x)m1(x )dx and these are the integral priors!!!

(22)

Intrinsic and Integral priors

If M1 is nested in M2, then the intrinsic priors satisfy π22) =

Z

π2I2 | θ111)dθ1 If we add the symmetrical equation

π11) = Z

π1I1 | θ222)dθ2

Again we have the integral priors!!!

19 / 65

(23)

Intrinsic and Integral priors

If M1 is nested in M2, then the intrinsic priors satisfy π22) =

Z

π2I2 | θ111)dθ1 If we add the symmetrical equation

π11) = Z

π1I1 | θ222)dθ2

Again we have the integral priors!!!

(24)

The Markov chains associated with the integral priors

20 / 65

(25)

The associated Markov chains

The integral prior π11) is the invariant σ-finite measure of the Markov chain with transition θ1 → θ01 defined by the following four steps

(26)

If this Markov chain is Harris recurrent, then the integral prior π11) can be approximated by simulation

Therefore, the transition θ1 → θ10 and the integral priors are essentially the same thing

There exists a parallel Markov chain for θ2 with the same properties;

in particular, if one is (Harris) recurrent then so is the other

22 / 65

(27)

The first example

One-sided testing for the exponential distribution

M1 : E xp(θ1), θ1 < 1 M2 : E xp(θ2), θ2 > 1

π1N1) ∝ θ1−11(0,1)1) π2N2) ∝ θ2−11(1,+∞)2)

(28)

Markov chain for θ1

1 x0 = −θ1log u1

2 θ2= −x0/ log(u2(1 − e−x0) + e−x0)

3 x = −θ2log u3

4 θ01= (1 − 1xlog u4)

u1, u2, u3, u4 ∼ U(0, 1)

24 / 65

(29)

M1 : θ1 < 1 M2 : θ2> 1

π11) π22)

0.0 0.2 0.4 0.6 0.8 1.0

01234

1 3 5 7 9 11 13 15 17 19

0.00.10.20.30.40.5

(30)

Integral priors can be applied to nested and non-nested situations Priors close to the initial default priors, and with predictive distributions as close as possible

The integral prior for each model takes into account the existence of the other model

26 / 65

(31)

Group invariance

An important situation is when M1 and M2 have the same group invariance structure

In this situation right-Haar priors are exact predictive matching for minimal training samples (Berger, Pericchi and Varshavsky (1998)) Right-Haar priors are Integral priors when these priors are the initial default priors

(32)

Location models

M1: N(θ, 1), π1N(θ) = c1

M2: DE (λ, 1), π2N(λ) = c2

π1(θ) = 1 and π2(λ) = 1 are the integral priors

Because these priors are improper, we expect a lack of stability in their associated Markov chains

28 / 65

(33)
(34)

30 / 65

(35)
(36)

1 x0∼ f1(x0 | θ1)

2 θ2 ∼ πN22 | x0)

3 x ∼ f2(x | θ2)

4 θ10 ∼ πN101 | x)

1 x0 ∼ f1A(x0 | θ1) ∝ f1(x0| θ1)IA(x )

2 θ2 ∼ π2N2 | x0)

3 x ∼ f2A(x | θ2) ∝ f2(x0 | θ2)IA(x )

4 θ01 ∼ π1N10 | x)

32 / 65

(37)

1 x0∼ f1(x0 | θ1)

2 θ2 ∼ πN22 | x0)

3 x ∼ f2(x | θ2)

4 θ10 ∼ πN101 | x)

1 x0 ∼ f1A(x0 | θ1) ∝ f1(x0| θ1)IA(x )

2 θ2∼ π2N2 | x0)

3 x ∼ f2A(x | θ2) ∝ f2(x0 | θ2)IA(x )

4 θ01∼ π1N10 | x)

(38)

M1: N(θ, 1), π1N(θ) = c1 and M2 : DE (λ, 1), π2N(λ) = c2 The constraint x ∈ A = [−10, 10] on the imaginary trainig samples prevents the explosion of the chain

0 10000 20000 30000 40000

−30−20−100102030

Our recommendation is keeping the imaginary training samples within an interval ±5s about the sample mean

33 / 65

(39)

The only thing one needs to apply this methodology is

To simulate minimal training samples from fi(x | θi), which seems easy to do, and

To simulate from the posteriors πiNi | x), which usually is also easy to do, or it can be done using MCMC

(40)

The one way heteroscedastic ANOVA

M1 : µ1= µ2 = · · · = µk = µ M2 : all the µi0s are not equal π1N(µ, σ1, . . . , σk) ∝ (σ1· · · σk)−1 πN21, . . . , µk, σ1, . . . , σk) ∝ (σ1· · · σk)−1

Here the simulation from the posterior π1N1| x) can not be performed directly

πN1(µ, σ1, . . . , σk | x) ∝

k

Y

i =1

σ−3i exp



−(xi 1− µ)2+ (xi 2− µ)2i2

 .

We use Gibbs sampling within this step

35 / 65

(41)

1 x0 ∼ f1A(x0 | θ1) ∝ f1(x0| θ1)IA(x0)

2 θ2∼ π2N2 | x0)

3 x ∼ f2A(x | θ2) ∝ f2(x | θ2)IA(x )

4 θ01∼ π1N10 | x) : Gibbs sampling with h ≥ 1 iterations

(42)

Four populations. 100,000 iterations for the Markov chain and h = 1, 10, 100, 200 iterations of the Gibbs sampling

There are no differences from h = 10 to h = 100 or larger, so h = 10 is enough for the Gibbs algorithm

Integral prior for model M2 concentrates mass in favor of model M1

37 / 65

(43)

Cano, J. A., Kessler, M. and Salmer´on, D. (2007a). Integral priors for the one way random effects model. Bayesian Analysis, 2-1, 59–68.

Cano, J. A., Kessler, M. and Salmer´on, D. (2007b). A synopsis of integral priors for the one way random effects model. Bayesian Statistics, 8, 577–582. Oxford University Press.

Cano, J. A., Salmer´on, D. and Robert, C. P. (2008). Integral equation solutions as prior distributions for Bayesian model selection. Test, 17-3, 493–504.

Cano, J. A. and Salmer´on, D. (2013). Integral Priors and Constrained Imaginary Training Samples for Nested and Non-nested Bayesian Model Comparison. Bayesian Analysis, 8-2, 361–380.

(44)

Salmer´on, D., Cano, J. A. and Robert, C. P. (2015). Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Statistica Sinica, 25-3, 1009–1023.

Cano, J.A. and Salmer´on, D. (2016). A Review of the Developments on Integral Priors for Bayesian Model Selection. BEIO, 32-2, 96-111.

Cano, J.A., Iniesta M. and Salmer´on, D. Integral priors for Bayesian model selection: How they operate from simple to complex cases. Submitted.

39 / 65

(45)

Computation of Bayes factors with integral priors

Monte Carlo

Laplace approximation Importance sampling

(46)

Monte Carlo

The Markov chain θ(1)i , θ(2)i , . . . for πii)

L→+∞lim 1 L

L

X

t=1

fi(x | θit) = mi(x) = Z

fi(x | θiii)dθi

Very large values of L are needed if fi(x | θi) is concentrated relative to πii)

41 / 65

(47)

Laplace approximation

mi(x) = R fi(x | θiii)dθi ˆ

πi is a nonparametric estimate of the integral prior πi

ˆ

mi(x) = (2π)dim(Θi )2 | ˆΣi|1/2fi(x | ˆθi)ˆπi( ˆθi)

θˆi = MLE

Σˆ−1i observed information matrix under Mi

(48)

Importance sampling I

ˆ

πi is a nonparametric estimate of the integral prior πi

mi(x) =Z

fi(x | θiii)dθi ≈ Z

fi(x | θi)ˆπii)dθi

=

Z fi(x | θi)ˆπii)

p(θi |x) p(θi |x)dθi

p(θi |x) the importance density

43 / 65

(49)

Importance sampling II

m2(x) =Z

f2(x | θ222)dθ2 = Z

f2(x | θ22N2| z2)m1(z2)dz22

=

Z f2(x | θ22N2| z2)

p(θ2 |x, z2) p(θ2 |x, z2)m1(z2)dz22

p(θ2 |x, z2) the importance density

The simulation of the Markov chain gives us simulations from m1(z2)

(50)

Multiple comparison

45 / 65

(51)

Integral priors for Multiple comparison

𝑀𝑖

𝑀1 𝑀2

𝑀𝑞

𝑄𝑖1(𝜃𝑖|𝜃𝑖) 𝑄𝑖2(𝜃𝑖|𝜃𝑖)

𝑄𝑖𝑞(𝜃𝑖|𝜃𝑖)

𝑄𝑖 𝜃𝑖 𝜃𝑖 = 𝑗≠𝑖𝑄𝑖𝑗(𝜃𝑖|𝜃𝑖) 𝑞 − 1

(52)

Integral priors for Multiple comparison

Definition: πii) forMi

The integral priorπii) is the invariant σ-finite measure of the Markov chain with transition θi → θi0 defined by the following four steps

𝑀

𝑖

𝑀

𝑗

47 / 65

(53)

Testing for the exponential distribution

M1: E xp(θ1), θ1∈ I1 = (0, 1) M2: E xp(θ2), θ2∈ I2 = (1, +∞) M3: E xp(θ3), θ3∈ I3 = (0, +∞)

πNii) ∝ θ−1i 1Iii) ξi = log θi

i = 1, 2, 3

(54)

ξ1 < 0 ξ2 > 0 ξ3∈ R

−20 −15 −10 −5 0

0.00.10.20.30.40.50.6

0 5 10 15 20 25

0.00.10.20.30.4

−20 −10 0 10 20

0.000.020.040.060.080.100.120.14

49 / 65

(55)

Variable selection

(56)

Variable selection for the linear regression model

Full model

y = X β + ε, ε ∼ Nn(0, σ2I) πN(β, σ) ∝ 1/σ

β ∈ Rk, σ > 0

X = [x1, . . . , xk] an n × k full rank matrix and n > k xj = (x1j, . . . , xnj)0, j = 1, . . . , k

Usually x1 = 1n

51 / 65

(57)

Submodels

The full model is represented by the matrix X = (xij) ∈ Rn×k R a subsequence of I = {1, . . . , n} representing rows of X C a subsequence of J = {1, . . . , k} representing columns of X XR,C = (xij)i ∈R,j ∈C and XC= XI,C

The submodel MC is represented by the matrix XC

MC :y = XCβC+ εC, εC ∼ Nn(0, σC2I) πNC, σC) ∝ 1/σC

(58)

M

C1

versus M

C2

C1, σC1) → (βC0

1, σ0C

1)

Select random sequences, R and S, from I = {1, . . . , n}, with

|R| = |C2| + 1 and |S| = |C1| + 1, such that XRC2 and XSC1 be full rank matrices.

1 Simulate a training sample y2∼ N(XRC1βC1, σ2C

1I)

2 Simulate the posterior πNC2, σC2 | y2, XRC2)

3 Simulate a training sample y1∼ N(XSC2βC2, σC22I)

4 Simulate the posterior πNC0

1, σ0C

1 | y1, XSC1)

54 / 65

(59)

The caterpillar dataset: 2

10

= 1024 models

Y = log of the average number of nests of caterpillars per tree in an area k = 10 potential explanatory variables defined on n = 33 areas

x1 altitude x2 slope

x3 number of pines in the area ...

Bayesian core: a practical approach to computational Bayesian statistics.

Jean-Michel Marin and Christian P. Robert. Springer.

(60)

Full model: Integral prior and π

N

(θ | y)

56 / 65

(61)

Full model: Integral prior and π

N

(θ | y)

−100 0 100 200

0.0000.0040.0080.012

−0.10 −0.05 0.00 0.05 0.10

05101520

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.00.51.01.5

−5 0 5

0.000.050.100.150.200.250.30

0.000.020.040.060.080.100.120.000.020.040.060.080.100.12 050100150200250050100150200250 051015051015 0123401234

(62)

Full model: Integral prior and π

N

(θ | y)

−30 −20 −10 0 10 20 30

0.000.020.040.06

−4 −2 0 2 4 6

0.00.10.20.3

−100 −50 0 50 100

0.0000.0050.0100.0150.020

−60 −40 −20 0 20 40 60

0.0000.0100.0200.030

−3 −2 −1 0

0.00.10.20.30.40.50.60.7

−3 −2 −1 0

0.00.10.20.30.40.50.60.7

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

0123

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

0123

−6 −4 −2 0 2 4

0.000.050.100.150.200.25

−6 −4 −2 0 2 4

0.000.050.100.150.200.25

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

58 / 65

(63)

Full model: Integral prior and π

N

(θ | y)

−10 −5 0 5 10 15

0.000.050.100.15

−60 −40 −20 0 20 40

0.000.010.020.030.04

−40 −20 0 20

0.000.010.020.030.040.050.06

0 5 10 15 20

0.000.100.200.30

0.00.51.01.50.00.51.01.5 0.00.10.20.30.40.00.10.20.30.4 0.00.10.20.30.40.50.00.10.20.30.40.5 0.00.51.01.52.02.53.00.00.51.01.52.02.53.0

(64)

The marginal distributions with integral priors

m(y) =Z

f (y | θ)π(θ)dθ =Z

f (y | θ)πN(θ) π(θ) πN(θ)dθ

≈ Z

f (y | θ)πN(θ) π(θ)ˆ

πN(θ)dθ ≈ π(ˆˆ θ) πN(ˆθ)

Z

f (y | θ)πN(θ)dθ

m(y) ≈ π(ˆˆ θ)

πN(ˆθ)mN(y)

60 / 65

(65)
(66)

Variable selection for Generalized linear models

For two Binomial regression models with a general link function

It can be done for multiple comparison!

1 Simulate a training sample y2

2 Simulate the posterior given y2

3 Simulate a training sample y1

4 Simulate the posterior given y1

62 / 65

(67)

Variable selection for Generalized linear models

For two Binomial regression models with a general link function

It can be done for multiple comparison!

1 Simulate a training sample y2

2 Simulate the posterior given y2

3 Simulate a training sample y1

(68)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ2I)

It can be done!

C1, σC1) → (βC0

1, σ0C

1)

1 Simulate a training sample y2:

it is easy

2 Simulate the posterior given y2: MCMC if it is necessary

3 Simulate a training sample y1: it is easy

4 Simulate the posterior given y1: MCMC if it is necessary

63 / 65

(69)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ2I)

It can be done!

C1, σC1) → (βC0

1, σ0C

1)

1 Simulate a training sample y2: it is easy

2 Simulate the posterior given y2:

MCMC if it is necessary

3 Simulate a training sample y1: it is easy

4 Simulate the posterior given y1: MCMC if it is necessary

(70)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ2I)

It can be done!

C1, σC1) → (βC0

1, σ0C

1)

1 Simulate a training sample y2: it is easy

2 Simulate the posterior given y2: MCMC if it is necessary

3 Simulate a training sample y1:

it is easy

4 Simulate the posterior given y1: MCMC if it is necessary

63 / 65

(71)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ2I)

It can be done!

C1, σC1) → (βC0

1, σ0C

1)

1 Simulate a training sample y2: it is easy

2 Simulate the posterior given y2: MCMC if it is necessary

3 Simulate a training sample y1: it is easy

4 Simulate the posterior given y1:

MCMC if it is necessary

(72)

Variable selection for Nonlinear regression models

y = g(X , β) + ε, ε ∼ Nn(0, σ2I)

It can be done!

C1, σC1) → (βC0

1, σ0C

1)

1 Simulate a training sample y2: it is easy

2 Simulate the posterior given y2: MCMC if it is necessary

3 Simulate a training sample y1: it is easy

4 Simulate the posterior given y1: MCMC if it is necessary

63 / 65

(73)

Conclusions and oncoming research

The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiNi |z)

This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies

Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress

(74)

Conclusions and oncoming research

The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiNi |z)

This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies

Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress

64 / 65

(75)

Conclusions and oncoming research

The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiNi |z)

This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies

Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general

Computation of Bayes factors with integral priors is work in progress

(76)

Conclusions and oncoming research

The application of integral priors just needs to simulate imaginay minimal training samples z from the involved models, and their posterior distributions πiNi |z)

This methodology can directly be applied to the comparison of nonnested models, that is a common restriction in other methodologies

Integral priors are obtained by simulation, and therefore the predictive distribution and the Bayes factors have not a closed form in general Computation of Bayes factors with integral priors is work in progress

64 / 65

(77)

Referencias

Documento similar