3.4.1 Adaptive Random Scan Gibbs Sampler
We could directly apply Theorem 11 to the ARSGS Algorithm 14 presented in Chapter 2. Let p = (p1, .., ps) be a probability vector and assume that the target
distribution sits on a product spaceX1×..× Xs. Recall, that the RSGS proceeds at each iteration by first choosing a coordinateiwith probabilitypi, and then updating the coordinate from its full conditional distributions. In the ARSGS Algorithms9
and 14 the adaptations of the selection probabilities p are separated by ki RSGS
iterations. Therefore, if the sequenceki is chosen to be non-decreasing, the ARSGS already fits into the AIRMCMC framework.
As we mentioned in Section3.3, it is hard to verify the simultaneous geomet- ric drift condition (3.9) for the ARSGS. On the other hand, the local simultaneous geometric drift condition (3.11) is a natural property for the ARSGS as long as the RSGS Markov kernel is geometrically ergodic for at least some selection probability vectorp= (p1, .., ps) (see Theorem5of Chapter2). We summarise our observations
in the following theorem
Theorem 15. Letπ be a target distribution onX1×..×Xs, whereXi =Rdi for some positive integersd1, .., ds. Consider a collection of RSGS kernels Pp parametrised by
the sampling weights p= (p1, .., ps). Assume that Pp satisfy Assumption 1 and for
somep= (p1, .., ps), Pp is geometrically ergodic , i.e., (3.9) holds. Then:
1. The collection of kernelsPpsatisfy the local simultaneous drift condition (3.11).
2. Then the modified ARSGS Algorithm14described in Chapter 2, with the cor- responding sequence of lags between adaptations ki = bciβc for some β > 0,
c >0, is an example of AirMCMC algorithm for whichi)-iii) of Theorem11
hold.
Proof of Theorem 15. Part 1 follows from Theorem 5 of Chapter 2. Part i) of the theorem follows by simple application of Theorem11.
Remark. One needs the adapted selection probabilities to converge, in order to derive the CLT usingiv)of Theorem 11. We do not have a proof that the adapted selection probabilities converge at all. However, one could choose the learning rate
am in the settings of the ARSGS so that the adapted probabilities converge to a
suboptimal value (i.e., takeam such that
P∞
m=1am<∞, whereamis defined in the
settings of Algorithm 9 and used as a learning rate in Step 3.2 of the algorithm). Now we are in a position to useiv)of Theorem11 in order to verify the CLT.
3.4.2 Kernel Adaptive Metropolis-Hastings
Our results are also applicable to the Kernel Adaptive Metropolis-Hastings (KAMH) algorithm presented in Sejdinovic et al. [2014]. The idea behind the KAMH is to locally adapt the variance of a symmetric random walk proposal based on a subsample of the whole previous chain history. Thus, the adaptive chain (Xn, γn) is not Markovian so that the results of Andrieu and Atchad´e [2007], Atchad´e and Fort [2010] do not apply. However, one may easily put the algorithm into the Air framework. We shall provide conditions which ensure that i) - iii) of Theorem 10
hold for the AirKAMH and thus, establish the SLLN and MSE convergence for the algorithm.
KAMH is an Adaptive Metropolis algorithm with a family of local proposals
QZ,ν(x,·) =N(x, κI+ν2M(x, Z)), (3.16)
where M(x, Z) is a d×d positive semidefinite matrix that depends on a current positionx∈Rdand d×tmatrixZ (see (3.17) for the precise representation). Here each columnZi,i= 1, .., tofZ is a randomly chosen state from the adaptive chain
history, γ is a fixed scale parameter (e.g., κ = 0.2), and ν is tuned on the fly in order to retain the average acceptance ratio around 0.234 (see e.g., Andrieu and Thoms [2008], Rosenthal [2011], Roberts et al. [1997]). Let {pi} be a sequence of
probability weights slowly decaying to zero. LetqZ,ν be the density corresponding to (3.16). The KAMH proceeds by iterating through three steps:
1. With probabilitypn, subsampleZ = (Z1, .., Zt) from the whole current output
{X1, .., Xn};
3. Accept/reject the proposal using the standard Metropolis acceptance ratio α(Xn, Y) = min ( 1, π(Y)qZ,ν(Xn,Y) π(Xn)qZ,ν(Y,Xn) ) .
4. Tune the proposal variance ν to retain the average acceptance ratio around 0.234: ν := exp log(ν) +√1 n{α(Xn, Y)−0.234} .
ImplicitlyM(x, Z) depends on a covariance kernel k(x, y) in Rd:
M(x, Z) =V(x, Z)(It−1
t1t)V
|(x, Z), (3.17)
V(x, Z) = 2(∇xk(x, z1), ..,∇xk(x, zt)),
where It is a t×t identity matrix and 1t is a t×t matrix of ones. If k(x, y) is a
linear kernel (i.e., k(x, y) = xTy ), then M(x, Z) = M(Z) does not depend on x
and approximates the global covariance structure of the target distribution. More complicated kernelsk(x, y) ,e.g., the Gaussian or Mat´ern kernel, (seeSejdinovic et al. [2014] for the definitions),Qz,ν(x,·) allow for local approximation of the covariance
structure. Thus, KAMH has the potential to adapt to distributions with complicated shapes.
Below we shall show, if the target distribution has super-exponential tails one can establish the simultaneous geometric ergodicity Assumption2, if (Z, ν) are restricted to any compact domain.
Proposition 9. Assume that the target distribution π in Rd has a density w.r.t. Lebesgue measure, which is differentiable, bounded, has super-exponential tails, i.e.,
lim sup |x|→∞ x |x|,∇logπ(x) =−∞,
and satisfies the curvature condition
lim sup |x|→∞ x |x|, ∇logπ(x) |∇logπ(x)| <0,
where | · | and h·,·i are the norm and the scalar product in Rd respectively. Let
k(x, y) be a Gaussian or Mat´ern kernel. Then the collection of Metropolis kernels
tion1 and the simultaneous geometric drift Assumption 2 for any compact setΓ in Rd×t+1.
Proof of Proposition 9. See Appendix A.
For the Air version of the KAMH, we updateZ in Step1at the pre-specified times (1.6), Ni, whereas the proposal ν in Step 4 could be updated at the times
bNi
l c for some integerl≥1, in the same manner as in Algorithm 16of Section 3.1.
Theorem 16. Assume that the target distribution is super-exponentially tailed, dif- ferentiable, bounded, and (Z, ν) are restricted to any compact domain Γ ⊂Rd×t+1.
Then for an Air version of the KAMH,i) - iii)of Theorem 10 hold. Proof of Theorem 16. Follows from Proposition 9.
Remark. One can see that due to Step 1 of the KAMH, the adapted parameter
γ = (ν, Z) does not converge, since we randomly subsampleZinfinitely often. Thus, we can not applyiv)of Theorem 10to derive the CLT.
3.5
Discussion
In this chapter we introduced a class of AMCMC algorithms, AirMCMC, where adaptations are separated with a sequence of increasing lags {nk}. In Section 3.2
we have proved that the simultaneous or local simultaneous drift Assumptions2 or
4, imply the SLLN, MSE convergence and, if the adapted parameter converges, the CLT for the AirMCMC. The same technique was used to prove the SLLN and CLT under the simultaneous polynomial drift Assumption3.
In Sections3.1and3.4 we have demonstrated that many of the known AM- CMC can be put into the Air framework (Algorithms 4and 17). In Section3.4 we have seen that this could lead to the algorithms with theoretical underpinning for the asymptotic convergence properties of the averages (3.1). Moreover, empirically, in Section 3.1 we have demonstrated that including a lag between the adaptations does not necessarily slow down convergence of the adaptive algorithm. On the con- trary, in Section3.1.2, we have experienced computational speed up, since the Air version of the adaptive algorithm spent less time adapting the parameter.
Our settings are different from what we have seen in the literature since the diminishing adaptation condition (3.15) does not necessarily hold. As we have
seen in Section3.2.4, without the diminishing adaptation condition, the AirMCMC algorithm might converge in distribution. This does not affect the properties of ergodic averages (3.1), and also it is easy to impose the condition, which guarantees convergence in distribution, as we have proven in Theorem13 of Section 3.5.
We have discussed in Section 3.3that our settings are closely related to the ones of Gilks et al. [1998], where the authors consider AMCMC with adaptations allowed to happen only at the regeneration times of the underlying Markov chains. It follows, that in the settings of Gilks et al. [1998], one can establish the MSE convergence and the CLT of the AMCMC. Unfortunately, the framework of Gilks et al.[1998] is not useful in high dimensional settings, since the regeneration times deteriorate to zero exponentially in dimension. On the other hand, by introducing a sequence of increasing lags{nk}between adaptations, that grow sufficiently fast, the
underlying Markov chains between the adaptations regenerate with an increasing to 1 probability, which allows us to exploit technique ofGilks et al.[1998] in the proofs of the main results.
An important open question about the design of AirMCMC algorithms is the optimal choice of the sequence{nk} that could potentially be established through information theoretical arguments (seeMacKay[2003]).