In this section, we are going to review different types of Variance Reduction (VR) methods. In the context of stochastic optimization algorithms, the VR methods are used to give a better estimator of the recourse function with a smaller variance and sampling error while using fewer samples than the CMC method. Some VR methods that are going to be studied are: Antithetic Variates (AV), Latin Hypercube Sampling (LHS), Quasi Monte Carlo (QMC) and Importance Sampling (IS).
2.8.1
Antithetic Variates
The Antithetic Variates (AV) method is a Variance Reduction method. Its principle is to generate negatively correlated pairs (ξj, ¯ξj), j = 1, . . . , N/2,
such that ξ
j is generated from the Uniform distribution U [0, 1]
dξ and ¯ξ
j is
given as 1 − ξj. The AV estimator is then calculated as:
1 N N/2 X j=1 Q(x, ξj) + Q(x, ¯ξj) (2.48)
AV is an unbiased estimator. Its variance is given by, σ2(x) N + Cov(Q(x, ξ j), Q(x, ¯ξj)) N (2.49)
According to Equation (2.49), the variance reduction of AV depends signif- icantly on the term Cov(Q(x,ξj),Q(x, ¯ξj))
N . If this term is smaller than zero, the
variance of the AV estimator is smaller than of the CMC method; otherwise, the variance of the AV estimator is increased relatively to the variance of the CMC estimator. In other words, AV becomes useful if and only if Q(x, ξ
j)
and Q(x, ¯ξj) have negative correlation. This property is shown to be satisfied
in two-stage stochastic linear programs and when the uncertainty only occurs on the RHS (Higle (1998)). In some situations when the monotonicity of the objective function is lost, the variance of the AV method is larger than the variance of the CMC method (Koivu (2005)).
2.8.2
Latin Hypercube Sampling
Stratified Sampling reduces the variance of the estimator by splitting the sample space Ξ into K strata and then generating samples from every strata. The number of samples generated from each strata is proportional to the probability of that strata. Hence, the samples spread across the entire sample space. When K = N , it is known as the Latin Hypercube Sampling (LHS) (McKay et al. (1979)). In the two dimensional case, the sample space is considered as a square grid and LHS places one sample for each row and column. In the multidimensional case, LHS places only one sample in each axis-aligned hyperplane. It has been shown that the variance of the LHS estimator is always less than or equal to the variance of the CMC estimator (McKay et al. (1979)). In particular, their relationship is given as:
VarLHS ≤
N
N − 1VarCMC (2.50)
Therefore, LHS always gives a better estimator than CMC.
LHS has also been widely used in stochastic optimization algorithms (Bai- ley et al. (1999), Shapiro et al. (2002), Linderoth et al. (2006), Homem-de Mello et al. (2011), Freimer et al. (2012)). It has been proven that the con- vergence rate of stochastic optimizations using LHS is never worse than those using CMC (Owen (1992) Drew and Homem-de Mello (2012)).
2.8.3
Quasi-Monte Carlo
Quasi-Monte Carlo (QMC) has been considered as one of the most popular VR methods (Niederreiter (1992),Lemieux (2009),Dick and Pillichshammer (2010)). The principle of QMC is that: instead of generating samples ran-
domly from the Uniform distribution on [0, 1]dξ, the QMC samples are gener-
ated in a specific way that can achieve a high-quality estimator. The quality of an estimator, according to the Koksma-Hlawka inequality (Niederreiter (1992)), is determined by the quality of the generated samples such that the difference between the empirical distribution and the Uniform distribution is minimal. This property is known as “star-discrepancy”. Moreover, the qual- ity of an estimator depends on the total variation of the estimate function. There has been a large number of research papers on the construction of low-discrepancy sequences (Bastin et al. (2006) Freimer et al. (2012)). Some examples of such sequences are: Halton and Sobol’ sequences.
In practice, the total variation of the estimate function can be very diffi- cult to calculate. To overcome this problem, some randomness are introduced into the generated QMC samples so that the error can be estimated using the standard methods such as Multiple Independent Replications. Some ex- amples of the randomized methods are: The Cranley-Patterson method and other scrambling algorithms (e.g.: Owen scrambling algorithm for Sobol se- quences, Reverse-Radix 2 algorithm for Halton sequences). It has been shown that: The sampling error in the randomized QMC estimator converges at the rate O(logN0.5∗(dξ−1)N1.5 ) (Bastin et al. (2006) Freimer et al. (2012)). This error
rate depends on the dimension of random variables, showing that QMC may converge slowly for high dimensional problems. To solve this problem, one can determine the effective dimension of the problem, and then apply QMC on these dimensions while applying CMC or LHS or some other efficient sampling methods on the other dimensions (Owen (1998) Owen (2003)).
In the context of stochastic optimization, QMC has been used for many years (Kalagnanam and Diwekar (1997) Drew and Homem-de Mello (2006)). The convergence of QMC in stochastic optimization has been studied by (Pennanen and Koivu (2005) Koivu (2005)). These papers show that the QMC can improve significantly the convergence rate of stochastic optimiza- tion algorithms.
2.8.4
Importance Sampling
Importance Sampling (IS) increases the quality of estimator by generating samples from a different distribution than the distribution of interest. The new distribution is called the importance sampling distribution. This is be-
cause the samples generated from this distribution are in the regions that contribute the most to the estimator. The principle of IS can be explained as follows: We want to calculate the expected value of a given function Q:
Ef{Q(ξ)} =
Z
Q(ξ)f (ξ)dξ (2.51)
where ξ ∼ f. Equation (2.51) is usually a very high dimensional integration so it is usually very difficult to evaluate. Using the Crude Monte Carlo (CMC) approach, equation (2.53) can be approximated as:
Ef{Q(ξ)} = 1 n n X i=1 Q(ξi) (2.52)
where ξi ∈ Rd are independent identical distributed (iid) samples and as-
sumed to be drawn efficiently from the distribution f . Although CMC
approximation is widely used in sampling methods due to its simple im- plementation, it has many drawbacks. Firstly, it is not always possible to draw samples efficiently from distribution f . Secondly, the variance of the
estimator (2.52) is σ2(Q(ξ))n . This is usually large, making the error of our
approximation large too. One possible solution is to take more samples in order to reduce the error. However, taking more samples means that the computation becomes more expensive or even intractable. It is therefore es- sential to find a way to reduce the variance while keeping the same number of samples. In other words, given the same number of samples, we want to find an alternative distribution, say g, that gives us a much lower variance and at the same time corrects for the fact that we are using distribution g, instead of using the original distribution f . The function g is called the IS distribution. The probability density function (pdf) of G is denoted as g. If we multiply and divide equation (2.51) by g(ξ), the expected value remains the same: Ef{Q(ξ)} = Z Q(ξ)f (ξ)dξ = Z Q(ξ)f (ξ) g(ξ)g(ξ)dξ (2.53)
with a condition that g(ξ) = 0 ⇐⇒ f(ξ) = 0. Equation (2.53) shows that
Ef{Q(ξ)} can be approximated as:
Ef{Q(ξ)} ≈ 1 n n X i=1 Q(ξi) f (ξi) g(ξi) (2.54) where ξi ∼ g. Define: λ = f (ξi) g(ξi) (2.55)
λ is known as the likelihood ratio. This ratio is necessary for keeping the IS estimator unbiased.
The motivation for using Equation (2.54) instead of Equation (2.52)
is that the variance of (2.52) is 1
nσ
2(Q(ξ)) while the variance of (2.54) is
1 nσ
2(Q(ξ)f (ξ)
g(ξ)). This suggests that: If we can select a good g(x), we can
achieve a great reduction in the variance. Choosing a good IS distribution, however, is a challenging process that is difficult to generalize and has moti- vated many papers in the statistics and simulation literature. We refer the interested reader to Asmussen and Glynn (2007) for a review of IS.
The IS estimator (2.54) maintains all of the attractive properties of CMC estimator: It is an unbiased estimator; it is a consistent estimator i.e. the approximate value is getting closer to the true value with probability one as the number of samples n goes to infinity; the variance becomes zero as n goes to infinity; and the convergence rate is independent on the dimension of x.
IS has been used in stochastic optimization for a long time (Dantzig and Glynn (1990) Infanger (1992) Dantzig and Infanger (1993)). These papers have shown many advantages of IS when applied to stochastic optimization algorithms. For example: Large-scale multistage portfolio optimization prob- lems can be solved efficiently (Dantzig and Infanger (1993)). Applying IS in stochastic optimization, in addition, helps capturing rare events (e.g.: power outage or a sudden rise in demand in the context of power generation and expansion planning for electric utilities) much more effective than the CMC method (Infanger (1992)).