CAPÍTULO IV: LOS CARACTERES MEDIOCRES
1. Hombres y Sombras
Bayesian inference methods have become the standard for cosmological analyses for a number of reasons: the prevalence of incomplete astronomical data sets, the need to combine multiple heterogeneous data sets, and the benefits of quantifying beliefs as probabilistic distributions. The power of Bayesian analysis is the capability to deal with partial data, by updating existent or ‘prior’ beliefs with each new or additional data set. This method is mathematically described in Bayes’ theorem:
P(Θ|D, H) =P(D|Θ, H)P(Θ|H)
P(D|H) (3.5)
∝ L(Θ)P(Θ|H) (3.6)
whereH is a hypothesis or model, described by a vectorΘof parameters, andDrepresents the data set as a whole (i.e. all observations which can be compared to the model). Equation 3.5 is the cornerstone of Bayesian analysis; simply put, it states that theposteriorP(Θ|D, H) is proportional to the product of thelikelihoodL(Θ) =P(D|Θ, H) and thepriorP(Θ|H).
These symbols can be broken down as beliefs about Θ and D: the prior probability distribution function (PDF) reflects the existing beliefs, before encountering data, about the parameters Θ. The PDF is defined over a subset ofn-dimensional space (where nis the number of parameters in Θ). The likelihood is the probability of the model parametersΘ producing the observed data D; that is, the probability of measuringD,given Θ. A function of both the data and theoretical model, the likelihood is the mechanism for updating beliefs about the probability of model parametersΘ. The posterior PDF reflects theupdated beliefs about the parameters, or the probability distribution ofΘ,given, or having measured, the data D. The evidence in the denominator reflects all observations, and is constant (independent of the hypothesised parameters), leading to the second, more commonly seen, form of Bayes’ theorem as a proportionality relation (Equation 3.6). Often, the hypothesis H is implied, and omitted from the above equations.
The common goal of most cosmological analyses is to glean more information about a model of the Universe. This can occur in several ways: testing the fit of a hypothesis model, selecting between multiple competing models, or constraining the parameters of a model. Bayesian inference is used for all of these; Hobson et al. (2014) reviews a myriad of applications of of Bayesian methods to cosmology. We concentrate on parameter estimation: unless otherwise specified, the problem we want to solve is to de- termine (i.e. obtain posterior PDFs of) parameters of the standard ΛCDM orwCDM cosmological model.
§3.2 Bayesian methods for parameter estimation 43
While Bayes’ theorem appears simple, the computations involved are often complex, or even computationally impossible. Rather than applying Equation 3.5 directly, probabilistic methods are used to find the PDF, which rely on sampling efficiently and faithfully over the prior distribution, making use of Monte Carlo or other simulation methods, and approximating over large numbers (the central limit theorem). We discuss the most common approach, Markov Chain Monte Carlo (MCMC), in Section 3.2.2, first introducing the basics of parameter estimation.
3.2.1
Classical curve fitting
Perhaps the most well-known form of parameter estimation is fitting a line or curve to data. The simplest form this can take is a line f(x) = mx+b. For a data set (x,y), many lines parametrised by gradient-intercept pairs (m, b) are tested, and one is found ‘closest’ to the data under some metric. This holds more generally for any curve and number of dimensions. We can formalise these by replacing the vector xof data with the matrixX whose columns are each an individual data point, and rows are different variables such that each observation is represented by vector Xi. The curve or function to fit
is generalisedf(Xi,Θ) for vectorXi of data and with model parametersΘ. Under a Euclidean metric
(i.e. a least-squares fit), the statistic to minimise is
X i
(yi−f(Xi,Θ))2, (3.7)
where the sum is over all data points. The data points in Equation 3.7 have equal weight, whereas in practice, often some points have larger errors, and so the fit parameters should reflect a higher tolerance in these points. A modification of Equation 3.7 to account for this minimises theχ2 statistic:
χ2=X
i
(yi−f(Xi,Θ))2
σyi
(3.8) However, Equation 3.8 still assumes that the errorsσyiin the data are uncorrelated. Neglecting to account
for correlated systematic errors can lead to biases, and/or gross underestimation of errors in results, an effect which motivates most of the work within this thesis. Correlations between data uncertainties can be written using a covariance matrix:
Cij =hσyiσyji. (3.9)
Generalising Equation 3.8 further gives
χ2= (y−f(X,Θ))C−1(y−f(X,Θ))T (3.10) where the summation (as in Equation 3.8) is implicit. Assuming normal distributions, the likelihood can be written as a function of the data and covariance as
L(Θ) = exp −χ 2(Θ) 2 (3.11) −logL=1 2(y−f(X,Θ))C(y−f(X,Θ)) T. (3.12) (3.13) The covariance matrices described in Section 3.4 fit into this picture within the likelihood term, as a way of quantifying correlated uncertainties in the data.
3.2.2
Monte Carlo sampling methods
This section discusses some Bayesian methods for estimating cosmological parameters, specifically those which use Monte Carlo simulation to explore a parameter space. We use Markov Chain Monte Carlo (MCMC) extensively in the analyses in Chapters 4 and 6, and nested sampling in the higher-dimensional
fits in Chapter 4. Section 6.5.1 contains a description of our use of MCMC in analysing DES data, while Section 4.4 refers to our fits using nested sampling with the MultiNest algorithm. In Bayesian inference, the goal is to estimate a posterior probability distribution function (PDF) from the data. This is often achieved by sampling or ‘walking’ the parameter space, given a likelihood term as a function of the data and model, and a prior distribution. Given a prior PDF reflecting the beliefs held before examining the data, and the likelihood, the posterior PDF can be retrieved through Monte Carlo methods. For each parameter, summary statistics condense the information about parameters contained in a set of data into a smaller but sufficient subset, for example the parameter values at the mean, standard deviation, and maximal likelihood. When dealing with large parameters spaces, many parameters in Θ are necessary for computing the likelihood but are not parameters of interest; these are called nuisance parameters which are marginalised over, i.e. the posterior likelihood is integrated over all values of that parameter.
In higher dimensional parameter spaces, the computational expense of calculating and integrating the likelihood necessitates Monte Carlo techniques to statistically sample the parameter space.
Markov Chain Monte Carlo
MCMC is by far the most well known and widely used of Bayesian sampling techniques in cosmology and the wider sciences. A Markov chain is a discrete-time stochastic process where the motion at each point only depends on the stage before it orn previous stages (that is, it has no memory), and ‘Monte Carlo’ refers to the broader technique of random sampling a distribution many times to solve a numerical problem. The combination of the two, MCMC, describes a class of methods now standard in cosmology (Lewis & Bridle, 2002), characterised by a series of points (walkers) which move through the parameter space as dictated by algorithms determined from the likelihood function, and converge to the desired posterior.
In all MCMC methods, a likelihood and prior PDF are required, the latter often taken from a family of tractable distributions e.g. uniform or gaussian. A number of samplers orwalkers have their positions generated from the prior distribution. At each step of each walk, a transition to a new position is proposed by drawing from a proposal density distribution, and accepted or rejected according to some metric, determined from the likelihood. There are numerous algorithms for computing this process, the most common being the Metropolis-Hastings and Gibbs algorithms. The cumulative trajectories of the walkers are referred to aschains, and their positions in the parameter space as a distribution informs the posterior PDF. Sensitivity to the starting position of each walker can be mitigated by burn-in (discarding the firstN points of each chain), and various tests for convergence.
Limitations to MCMC include its reliance on an explicitly computed likelihood, which may be computationally complex; its potential dependence on a starting position (where it is possible to get stuck in a local extremum) and difficulties in testing for convergence. The common algorithms for determining transition, Metropolis-Hastings and Gibbs, can have suboptimal speed particularly in more dimensions; some of the following variants of MCMC (sometimes referred to as ‘MCMC-like’ techniques, or included within the MCMC umbrella) perform better in some circumstances.
Nested sampling and other variants
Nested sampling (Skilling, 2004) is a similar technique to MCMC, with the notable characteristic of using the likelihood function to map the many-dimensional parameter space into one dimension by determining a sequence of subspaces in the prior PDF (with size called ‘prior mass’) enclosed by contours of equal likelihood, each of which encloses the next. This effective reduction in dimensionality means nested sampling performs better than MCMC in higher-dimensional spaces; for this reason we choose it for fits with more than approximately six parameters. A common nested sampling algorithm, MultiNest (Feroz & Hobson, 2008; Feroz et al., 2009, 2013) (with Python implementationPyMultiNest
§3.2 Bayesian methods for parameter estimation 45
may have multiple peaks or ‘modes’, or large degeneracies (Hobson et al., 2014). In MultiNest, the likelihood is evaluated at sample live points (similar to the ‘walkers’ in MCMC), drawn initially from the prior distribution. At each step the point with lowest likelihood is replaced with a point within the iso-likelihood contour. This way the live points are iteratively replaced, and the sequence of contours (specifically, their prior masses) monotonically shrinks until convergence, where the posterior PDF is recovered from the positions and histories of the set of live points, which are similar to MCMC chains. A variant of MCMC called simulated annealing uses thermodynamic analogies to ‘speed up’ (increase the gradient of) walkers when in areas of low probability, with the effect of searching more broadly in those areas, while sampling the higher-probability areas with greater resolution. Similarly, Hamiltonian Monte Carlo models the ‘energy’ of the system using a Hamiltonian function and favours states with higher energy, improving performance compared to MCMC. Sequential (or Particle) Monte Carlo increases computational efficiency by pooling points in ‘particles’, where a transition kernel function determines moving to the next stage.
3.2.3
Beyond MCMC
The above Bayesian methods for parameter estimation, MCMC and nested sampling, have served as- tronomers well thus far; however, their limitations include needing theoretically prescribed functions in the likelihood, namely models of the data, including its systematic errors. Computing these explicitly and to the degree of precision required can be challenging in modern and future cosmological surveys, where it is especially important to be impervious to systematics. Throughout the analyses in this work, we make good estimates of both the data and systematics: a model for distances to spectroscopically normal SNe Ia assuming distances from a standard FLRW model of the Universe with wCDM or ΛCDM cos- mology, with statistical and systematic errors for these data quantified by covariance matrices (described primarily in Section 3.4, but also in Sections 4.5.3 and 6.5.2). We work using these robust assumptions, but are aware that we rely on them, and also recognise some recent developments of Bayesian methods which are independent of these assumptions, or likelihood-free. We will describe the motivations for and details of two such methods: approximate Bayesian computation and Bayesian hierarchical methods. Approximate Bayesian computation
There are benefits to forgoing an explicit likelihood term: it may be difficult to model (e.g. selection effects in SN Ia cosmology), and/or covariance matrices can be computationally expensive to evaluate and invert at every point. Approximate Bayesian Computation (ABC) circumvents this by using forward model simulation: for each set of model parameters in parameter space, reliably generating a realisation of the data. For supernovae, any simulation package that can generate realistic observations (lightcurves) from a cosmology and SN parameters can be used, with the most common examples being SNANA (Kessler et al., 2009a) and sncosmo(Barbary et al., 2016). Recently ABC has been formalised for cosmology in Jennings & Madigan (2017) and applied to supernovae in Jennings et al. (2016). In these works Sequential Monte Carlo is used for efficiency.
By going straight from the model to inferred ‘model’ data which can be compared with observed data, ABC allows likelihood-free parameter estimation. Instead, the aim is to model posterior truthfully without likelihood. A mathematical model for the data is still necessary, specifically for turning parameters, drawn from the prior, into simulated data; however the likelihood no longer needs to be calculated. The comparison is then between the forward-modelled data and the actual observed data. Points within the prior are accepted if they are sufficiently or ‘approximately’ close in distribution. Eventually, the distribution of accepted points is a model for the posterior (like MCMC, in finite time this is approximate). As with MCMC, a PDF for each model parameter is obtained by selectively rejecting sample points according to some metric. In continuous systems, distributions will match exactly with probability zero. Thus, the distributions – the data, and that forward-modelled from points drawn from the prior – are compared using a threshold. Some metric (e.g. Euclidean at some number of points in the distribution) can be used, along with a threshold ǫ. Instead of comparing a full
forward-modelled distribution it is often useful to use a number of summary statistics (e.g. the mean) which can be compared more easily.
Bayesian hierarchical methods
A limitation of maximal likelihood methods is that they only allow for errors in the ‘Y’ or dependent data (in this context the distance moduli µ of supernovae) to be taken into account (via covariance matrices), and not errors in the ‘X’ or independent data (the redshift z). This limitation was one motivation (Gull, 1989) for Bayesian hierarchical methods (BHM), which include layers of hierarchy in the model parameters. In BHM there is a distinction between observed quantities and hiddenlatent variables, to reflect that observation is inexact, and that there are degrees of knowledge obtained through observations. While performing simulations for BHM, the ‘true’ values of parameters are separated from their observed values. Similarly, intrinsic variation in observables is separated from observational uncertainties. A model in BHM is characterised by itshyperparameters, in this context the parameters of its prior distributions (rather than of the model), defined by informative priors (hyperpriors). Instances of BHM applied to supernovae have included March et al. (2011); Rubin et al. (2015); Shariff et al. (2016); Mandel et al. (2009), Wolf et al., (in prep.), Hinton et al., (in prep.); the mathematics of various applications of BHM is formalised in these works.
Of these, Hinton et al., (in prep.) and Wolf et al., (in prep.) describe methods applied to the DES data discussed in Chapter 6. Hinton et al., (in prep.) simulate layers of supernova observables including selection effects, separating observed quantities from simulated hyperparameters of the parent SN Ia populations and underlying cosmology. A likelihood function is computed explicitly, but in a large number of parameters; the many-dimensional parameter space is sampled using Hamiltonian Monte Carlo. In contrast, like ABC, BAMBIS also involves forward modelling SN Ia lightcurves at each point in parameter space. The model is allowed to include stochastic effects assessed through simulations from same point in parameter space, to realise sampling variations. Monte Carlo simulations rather than prescribed values are used for the model mean and covariance. The distributions (modelled and observed) are compared, and selection effects and sampling variance are taken into account in the MC simulations of underlying distributions.