The Bayesian probability of a model M given data D can be written via Bayes’ theorem (Eq. 2.1),
P(M|D) = p(D|M)×P(M)/p(D), (2.16)
where P is a probability and p is a pdf. This probability is proportional to our prior belief regarding the model, P(M). This is meaningful if we consider a ratio
of probabilities for models Ma and Mb, eliminating p(D) and permitting only relative prior belief in a model,
P(Ma|D)/P(Mb|D) | {z } Posterior odds =|p(D|Ma){z/p(D|Mb}) Bayes-factor ×|P(Ma){z/P(Mb}) Prior odds . (2.17) The Bayes-factor indicates how our prior odds ought to change because of the data, resulting in our posterior odds. The Bayes-factor is, in fact, a ratio of evidences found from parameter inferences in the models in the denominator in Eq. 2.4.
Individual evidences are meaningless — it is necessary to compare against a reference model with a Bayes-factor. If the Bayes-factor is greater than (less than) one, the model in the numerator (denominator) is favoured. The interpretation of Bayes-factors is somewhat subjective, though a popular choice is the Jeffreys’ scale, Table 2.4.1, to ascribe qualitative meanings to Bayes-factors. Prior odds are somewhat subjective and there might exist a spectrum of assigned prior odds amongst investigators. All investigators, however, will make identical conclusions from the posterior odds, if the Bayes-factor is sufficiently large.
The Bayes-factor quantitatively incorporates a principle of economy widely- known as Occam’s razor [58] and in physics as “fine-tuning” or “naturalness” [59– 61]. It is insightful to consider the evidenceZ = p(D|M)a function of the data
normalised to unity, i.e., as a sampling distribution [62,63]. Natural models “spend” their probability mass near the obtained data — a large fraction of their parameter space agrees with the data. Complicated models squander their probability mass away from the obtained data. This is illustrated in Fig. 2.4.1. Whilst it might be that a model is complicated because it has many parameters, and so makes a range of predictions for the data, a model with many parameters can be simple. Simple models are falsified by experiments more easily than complicated models, because
2.4. Evidences and p-values 41 they make sharper predictions. It might be said that Bayesian statistics reifies Occam’s razor, “fine-tuning” and “naturalness” arguments. “Naturalness” is no longer a nebulous, aesthetic criterion; it is formalised and justified by Bayesian statistics [64].
The frequentist test is a χ2 test: we assume that the χ2
Min has a particular distribution (its sampling distribution) were the experiments repeated ad infinitum, and calculate the p-value — the probability of obtaining a χ2 as large or larger than the observed χ2
Min:
p-value=1−Fχ2Min, N, (2.18)
where F is a cumulative χ2-distribution with N degrees of freedom. We assume that the distribution is χ2N, where N, the number of degrees of freedom, is the number of contributions to the χ2minus the number of parameters that were fitted to achieve χ2
Min. This assumption is reasonable if and only if the contributions to the χ2are approximately Gaussian and the parameter space completely maps the observables space. It could, in principle, be checked via Monte Carlo — we could repeat the following procedure many times: draw pseudo-experimental data from the sampling distributions (assuming that the best-fit point is true), fit the model’s parameters, and note the χ2
Min achieved, and lastly plot a histogram of the χ2Min. This, however, is completely unfeasible, because of the CPU time required to fit the model’s parameters.
One solution is to treat the pseudo-experimental data as a “perturbation,” and re-minimise the χ2only in the vicinity of the observed best-fit point. The feasibility of this, though, is not clear.
An easier but less robust check is to check only the sampling distribution of the χ2 at the best-fit point, rather than the distribution of χ2
Min. The procedure is identical, except that the parameters are not refitted. We expect the result will be χ2-distribution with a number of degrees of freedom equal to the number of contributions to the χ2, i.e., we do not check whether subtracting the number of fitted parameters to account for fitting is reasonable, only that the contributions to the χ2are approximately Gaussian.
We will consider p-values of less than 5% to be significant. This is aggressive; we have a appreciable chance of a type-II error.∗ One might wonder whether a
42 Chapter 2. Statistics
Grade Bayes-factor, B Interpretation
0 K>1 Favours hypothesis in numerator.
1 1>K>10−0.5 Barely worth mentioning.
2 10−0.5>K >10−1 Substantial. 3 10−1>K >10−1.5 Strong. 4 10−1.5>K >10−2 Very strong. 5 K<10−2 Decisive.
Table 2.4.1: The Jeffreys’ scale for interpreting Bayes-factors [57], which are ratios of evidences. We assume that the favoured model is in the denominator, though this could be readily inverted.
Observed data, DObs Data, D Evidence, Z = p( D |M )
Good simple model: concentrated probability mass at observed data.
Bad simple model: probability mass wasted here.
OK complicated model: spreads probability mass thinly.
”Naturalness” or Occam’s razor Good simple model
Bad simple model Complicated model
Figure 2.4.1: Illustration of the evidence, interpreted as a sampling distribution, originally from Ref. [63]. The observed evidence is the evidence evaluated at the observed data. The blue line shows a model that concentrates its probability mass at the observed data: it is a good, simple model. The green line shows a model that concentrates its probability mass away from the observed data: it is a bad, simple model. The red line shows a model that thinly spreads its probability mass around the observed data: it is an OK, complicated model. The complicated model might achieve a maximum likelihood greater than that of the good, simple model, but the simple model is favoured by the Bayes-factor.
2.5. Prior choices and effects 43