• No se han encontrado resultados

CAPÍTULO I. REVISIÓN DE LA LITERATURA

4. AUTO-CONFIANZA

2.4.1 Inferring Distributions and Normalizing Constants

Bayesian inference for mathematical models requires to integrate out the parameter distribution in order to obtain posterior predictive distributions and marginalized likelihoods. Sometimes, this can be conducted analytically when a certain likeli- hood is used with a so-called conjugate prior (Schlaifer and Raiffa, 1961). Then, a corresponding posterior can be analytically inferred (Box and Tiao, 1973). This holds for the variables of individual models and, in M-closed, even on both levels of inference (MacKay, 1992), i.e., also for the distribution over all models. This conjugacy holds mostly for distributions of the exponential family like Binomial or Gaussian, as respective examples for discrete and continuous distributions. The Gaussian distribution is even self-conjugate: A Gaussian prior is a conjugate prior for the Gaussian likelihood (Equation 8), and together they automatically yield a Gaussian posterior as analytical solution to Bayesian inference. A small collection of other conjugate examples (e.g., DeGroot, 2005) is given in Table 2.

Table 2: Examples of conjugate distributions: corresponding posterior distributions are analy- tically tractable if the specified likelihood is used with a conjugate prior.

likelihood p(D|Θm) conjugate prior p(Θm) posterior p(Θm|D)

Gaussian Gaussian Gaussian

Poisson Gamma Gamma

Binomial Beta Beta

Categorical Dirichlet Dirichlet

Multinomial Dirichlet Dirichlet

However, such analytical solutions are rarely applicable. A typical example is a Gaussian likelihood in linear regression with normally distributed regression para- meters. Usually, as for nonlinear models, Bayesian inference becomes analytically intractable and other methods have to be employed.

Most of these methods are based on statistical sampling of the distributions. In principle, the methods exploit that, first, Bayesian inference and uncertainty quan- tification rests upon the proportionality: posterior ∝ likelihood · prior (see Equa- tion 4); and, second, that Bayesian model rating rests upon turning this proporti- onality into equality by marginalization.

The most straight-forward numerical approach that provides both is plain Monte Carlo (MC) integration (e.g., Hammersley and Handscomb, 1964). With MC we numerically draw NM C independent samples ζi from a random variable Z with

distribution of interest q(ζ), e.g., a prior parameter pdf. With these samples, we can approximate the expected value Eq[f (ζ)] of a function f (ζ) (e.g., a likelihood

function) over q(ζ) that is defined as:

Eq[f (ζ)] = Z f (ζ)q(ζ)dζ ≈ 1 NM C NM C X i=1 f (ζi) (22)

BME is the likelihood function p(D|Θm) integrated (marginalized) over the para-

meter prior pdf p(Θm) and can therefore be evaluated with plain MC via Equation

22. MC is computationally demanding but reliable in converging to the expecta- tion. Convergence of the mean to the expectation is guaranteed by the law of large numbers (see Section 3.2.2). Therefore, I employ MC in this thesis (see Chapter 4). Advanced (and related) methods like Markov Chain Monte Carlo (MCMC) allow for sampling of q(ζ) if it is not directly tractable (see, e.g., Andrieu et al., 2003) but can be evaluated at ζi. Briefly, MCMC “jumps” over the target distribution and

stores thereby accepted samples in a so-called chain. The acceptance of samples follows strict rules that assure convergence to the target distribution. In Bayesian

inference, these jumps are in accordance with the above proportionality. Hence, the posterior is sampled and resembled by the collection of samples in the chain - yet, only up to an unknown constant. For normalization, further processing of the samples has to be applied (e.g. Vehtari et al., 2017). Such advanced techni- ques enable us to efficiently perform Bayesian inference by accessing the involved distributions by more sophisticated sampling strategies than plain MC, e.g., some MCMC techniques employ several chains. They usually provide more informative samples faster but, as trade-off, they typically have methodical parameters that require fine-tuning. Despite not being employed here, I provide more specifics on numerical methods and their ties to Bayesian inference in Appendix A.

2.4.2 Bayesian Bootstrap

Like every other numerically estimated quantity, the model weights from BMS/BMA, Pseudo-BMS/BMA and Bayesian Stacking are subject to inferential uncertainty:

• First, the evaluation of weights in either method rests on quantities that are marginalized over the whole considered parameter distribution (p(Θm)

or p(Θm|D∅)). With these distributions being approximated by numeric

samples, there is always uncertainty about the convergence of the margina- lized quantities. This is a question of appropriate sampling algorithms and sufficient numerical sample sizes to assure full convergence (Sch¨oniger et al., 2014).

• Second, it remains unclear whether the used observations D are a sufficient proxy for the whole unknown data distribution q(y|Mtrue) (see the problem

of finite data (Nearing and Gupta, 2018) in Section 1.1). Especially in pre- dictive model selection or combination like Pseudo-BMS/BMA and Stacking, this uncertainty propagates to the estimated model weights and has to be accounted for, e.g., via a Bayesian Bootstrap approach as in Yao et al. (2017) for statistical models.

The Bayesian Bootstrap (BB) introduced by Rubin (1981) can be considered as Bayesian analogue of the Frequentist bootstrapping (Efron, 1979). It evaluates the uncertainty of sampled distributions by resampling (Efron and Tibshirani, 1994). Both provide a non-parametric approximation to the distribution of a random variable. The BB employs a uniform Dirichlet distribution, i.e., a distribution over distributions: The data-points themselves stem from the data distribution q(y|Mtrue). Every available data-point Do in D is a sample from this distribution.

Yet, D resembles only one instance of y that follows q(y|Mtrue). Hence, also any

derived quantities like ln p(Do|D∅, Mm) in Pseudo-BMS/BMA are only a special

randomized samples, the Dirichlet distribution lends itself to be suitable.

The Dirichlet distribution is the conjugate prior (cf. Section 2.4.1) to the poste- rior distribution for both, the multinomial and the categorical (a special case of the multinomial distribution) likelihood distribution (see Table 2). Let us take i.i.d. elements in a vector ζ as samples of some random variable Z. Via the Diri- chlet distribution, their occurrences are assigned a posterior probability of 1 since they are contained in ζ, expressed as Dirichlet(1) with 1 ≡ (1, ..., 1) of length Ns. Samples of Z that are not in ζ are assigned 0 probability since they have

no probability under the sample cumulative distribution function (Rubin, 1981). Figuratively spoken, each sample has its own bin. The prior distribution of sam- ples taken from these bins resembles a categorical distribution - one “category” for each bin. The multinomial distribution generalizes this over multiple drawings from all these bins; and the Dirichlet distribution depicts the respective posterior. The BB is explained in the following for ζ being the vector of logarithmic LOO predictive densities of model Mm, i.e., with ζo,m = ln p(Do|D∅, Mm). Per Boots-

trapping replication b with b = 1 : NBB the drawn posterior probabilities α1:Ns,b

for ζ follow (Yao et al., 2017):

α1:Ns,b ∼ Dirichlet(1) (23)

In the Bayesian Bootstrap procedure, we now draw NBB statistically plausible

and varying alternatives of ζ. These so-called Bootstrapping replications yield the sampling-based Bootstrapping distribution of the distribution of Z over which any statistical moment can be inferred (Rubin, 1981) - for instance the mean (Yao et al., 2017): ¯ ζb,m = Ns X o=1 αo,bζo,m (24)

Thereof, the NBB replicates of the model weight of model Mmare simply estimated

by: wb,m = exp(Nsζ¯b,m) PNM m=1exp(Nsζ¯b,m) (25) The expected weight over the whole BB distribution then writes as:

wmBB= 1 NBB

X

b=1

Bootstrapping typically counteracts extreme weights of 0 or 1 (Yao et al., 2017). The major strength of the BB is, however, that it allows to formulate likelihood statements about the moments of the BB distribution (Rubin, 1981). This means that the BB mean of weights wBB

m is - accounting for the uncertainty in definiteness

of data D - more likely than the direct calculation of weights without bootstrap- ping. In case the weights calculated without bootstrapping are the same as after applying it, bootstrapping can be seen as confirmation. The additional compu- tational costs of the BB are very small because the quantities required to apply it (here, ζo,m = ln p(Do|D∅, Mm)) are already available for calculating the model

rating scores.

Documento similar