Communication-avoiding
Krylov subspace methods
in finite precision
Erin Carson, UC Berkeley
Advisor: James Demmel
Model Problem: 2D Poisson on grid, . Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution .
Roundoff error can cause a decrease in attainable accuracy of Krylov subspace methods.
This affect can be worse for “communication-avoiding” (-step) Krylov subspace methods – limits practice applicability!
Residual replacement strategy of van der Vorst and Ye (1999) improves attainable accuracy for classical Krylov methods
We can extend this strategy to communication-avoiding variants. Accuracy can be improved for minimal performance cost!
What is communication?
• Algorithms have two costs: communication and computation
• Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)
sequential comm.
parallel comm.
• On modern computers, communication is expensive, computation is cheap – Flop time << 1/bandwidth << latency
– Communication bottleneck a barrier to achieving scalability – Also expensive in terms of energy cost!
• Towards exascale: we must redesign algorithms to avoid communication
Communication bottleneck in KSMs
2. Orthogonalize (with respect to some )
• Inner products
• Parallel: global reduction Projection process in each iteration: 1. Add a dimension to
• Sparse Matrix-Vector Multiplication (SpMV)
• Parallel: comm. vector entries w/ neighbors
• Sequential: read /vectors from slow memory
Dependencies between communication-bound
kernels in each iteration limit
SpMV orthogonalize
• A Krylov Subspace Method is a projection process onto the Krylov subspace
• Linear systems, eigenvalue problems, singular value problems, least squares, etc.
Example: Classical conjugate
gradient (CG)
SpMVs and inner products
require communication in
each iteration!
for solving
Let
for until convergence do
end for
Related references:
(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der
Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel, 2013), (Feuerriegel and Bücker, 2013).
• Krylov methods can be reorganized to reduce communication cost by • Many CA-KSMs (-step KSMs) in the literature: CG, GMRES, Orthomin,
MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR
Communication-avoiding CA-KSMs
• Reduction in communication can translate to speedups on practical problems • Recent results: x speedup with for CA-BICGSTAB in GMG bottom-solve;
Benchmark timing breakdown
7
• Plot: Net time spent on different operations over one MG solve using
cores
• Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini
network, 3D torus
• 3D Helmholtz equation
CA-CG overview
Starting at iteration for , it can be shown that to compute the next steps (iterations where ),
.
1. Compute basis matrix
of dimension -by-, where is a polynomial of degree . This gives the recurrence relation
,
where is -by- and is 2x2 block diagonal with upper Hessenberg blocks, and is with columns and set to 0.
• Communication cost: Same latency cost as one SpMV using matrix powers
Avoids communication:
• In serial, by exploiting temporal locality:
• Reading , reading
• In parallel, by doing only 1 ‘expand’ phase (instead of ).
• Requires sufficiently low ‘surface-to-volume’ ratio
Tridiagonal Example:
The matrix powers kernel
e Matrix Powers
9 Sequential
Parallel A3v
A2v
Av v A3v
A2v
Av v
black = local elements
red = 1-level dependencies
green = 2-level dependencies
blue = 3-level dependencies
2. Orthogonalize:
Encode dot products between basis vectors by
computing Gram matrix of dimension -by- (or compute Tall-Skinny
QR)
•
Communication cost
: of one global reduction
3. Perform iterations of updates
•
Using and , this requires
no communication
•
Represent -vectors by their length- coordinates in
•
Perform iterations of updates on coordinate vectors
4. Basis change
to recover CG vectors
•
Compute locally,
no communication
Compute such that Compute
[
𝑥
𝑠𝑘+𝑠−
𝑥
𝑠𝑘,
𝑟
𝑠𝑘+𝑠,
𝑝
𝑠𝑘+𝑠]
=𝑌
𝑘[
𝑥
¿ ¿
𝑘
,
𝑠
′
,
𝑟
𝑘,𝑠
′
,
𝑝
𝑘,𝑠
′
]
¿
for do for do
end for end for
CG
CA-CG
Krylov methods in finite precision
• Finite precision errors have effects on algorithm behavior (known to
Lanczos (1950); see, e.g., Meurant and Strakoš (2006) for in-depth survey)
• Lanczos:
• Basis vectors lose orthogonality
• Appearance of multiple Ritz approximations to some eigenvalues
• Delays convergence of Ritz values to other eigenvalues
• CG:
• Residual vectors (proportional to Lanczos basis vectors) lose orthogonality
• Appearance of duplicate Ritz values delays convergence of CG approximate solution
• Residual vectors and true residuals deviate
•
CA-KSMs mathematically equivalent to classical KSMs
•
But convergence delay and loss of accuracy gets worse with
increasing !
•
Obstacle to solving practical problems
•
Decrease in attainable accuracy some problems that KSM
can solve can’t be solved with CA variant
•
Delay of convergence if # iterations increases more than
time per iteration decreases due to CA techniques,
no
speedup expected!
CA-Krylov methods in finite precision
This raises the questions…
For CA-KSMs,
•
How bad can the effects of roundoff error be?
•
And what can we do about it?
For solving linear systems:
•
Bound on maximum attainable accuracy for CG and
CA-BICG
•
Residual replacement strategy: uses bound to improve
For CA-KSMs,
•
How bad can the effects of roundoff error be?
•
And what can we do about it?
For eigenvalue problems:
•
Some ideas on how to use this new analysis…
•
Basis orthogonalization
•
Dynamically select/change basis size ( parameter)
•
Mixed precision variants
•
Extension of Chris Paige’s analysis of Lanczos to CA variants:
•
Bounds on local rounding errors in CA-Lanczos
•
Bounds on accuracy and convergence of Ritz values
•
Loss of orthogonality
convergence of Ritz values
This raises the questions…
Maximum attainable accuracy of CG
• In classical CG, iterates are updated by and
• Formulas for and do not depend on each other - rounding errors cause the true residual, , and the updated residual, , to deviate
• The size of the true residual is bounded by
• When , and have similar magnitude
• When , depends on
Example: Comparison of convergence of true and updated residuals for CG vs. CA-CG using a monomial basis, for various values
Model problem (2D Poisson on grid)
• Better conditioned polynomial bases can be used instead of monomial. • Two common choices: Newton and
Chebyshev - see, e.g., (Philippe and Reichel, 2012).
Residual replacement strategy for CG
• van der Vorst and Ye (1999): Improve accuracy by replacing updated
residual by the true residual in certain iterations, combined with group update.
• Choose when to replace with to meet two constraints:
1. Replace often enough so that at termination, is small relative to
2. Don’t replace so often that original convergence mechanism of updated residuals is destroyed (avoid large perturbations to finite precision CG recurrence)
Residual replacement condition for CG
Tong and Ye (2000): In finite precision classical CG, in iteration , the
computed residual and search direction vectors satisfy
,
𝑝
^
𝑚= ^
𝑟
𝑚+
𝛽
𝑚𝑝
^
𝑚−1+
𝜏
𝑚Then in matrix form,
𝐴 𝑍𝑚=𝑍𝑚𝑇𝑚− 1
𝛼𝑚′
^
𝑟𝑚+1
‖
𝑟^0‖
𝑒𝑚+1𝑇
+ 𝐹𝑚 𝑍𝑚=
[
𝑟^0‖
𝑟^0‖
, … , ^𝑟𝑚
‖
𝑟^𝑚‖
]
with
where is invertible and tridiagonal, and , with
𝑓 𝑚= 𝐴 𝜏𝑚
‖
𝑟^𝑚‖
+1
𝛼𝑚
𝜂𝑚+1
‖
𝑟^𝑚‖
−𝛽𝑚 𝛼𝑚−1
𝜂𝑚
21
Residual replacement condition for CG
Tong and Ye (2000): If sequence satisfies
,
• As long as satisfies recurrence, bound on its norm holds regardless of
how it is generated
• Can replace by and still expect convergence whenever
is not too large relative to and (van der Vorst & Ye, 1999).
‖
𝑟^𝑚+1‖
≤(
1+ 𝒦 𝑚)
min𝜌 ∈𝒫 𝑚,𝜌(0)=1
‖ 𝜌
(
𝐴+ Δ 𝐴𝑚)
𝑟^1‖ where and .• (van der Vorst and Ye, 1999): Use computable bound for to update , an
estimate of error , in each iteration: ,
• Set threshold , replace whenever reaches threshold
Residual replacement strategy for CG
if
end
• Assuming a bound on condition number of , if updated residual
Pseudo-code for residual replacement with group update for CG:
group update of approximate solution
Error in basis change
Sources of roundoff error in CA-CG
Error in computing -step basis
Error in updating coefficient vectors
Computing the -step Krylov basis:
Updating coordinate vectors in the inner loop:
with
Recovering CG vectors for use in next outer loop:
• We can write the deviation of the true and updated residuals in terms of these errors:
Maximum attainable accuracy of CA-CG
• We extend van der Vorst and Ye’s residual replacement strategy to CA-CG
• Making use of the bound for in CA-CG, update error estimate by:
A computable bound
otherwise
𝑗
=
𝑠
where
Estimated only once
flops per iterations; no communication
flops per iterations; 1 reduction per iterations
Extra computation all lower order terms, communication increased by at most factor of 2.
if
break from inner loop and begin new outer loop end
Residual replacement for CA-CG
• Use the same replacement condition as van der Vorst and Ye (1999):
Pseudo-code for residual replacement with group update for CA-CG:
CACG Mono. Newt.CACG Cheb.CACG CG
s=4 354 353 365
355 s=8 224, 334, 401, 517 340 353
s=12 135, 2119 326 346 Residual Replacement Indices
• # replacements small compared to total Total Number of Reductions
• In addition to attainable accuracy,
CACG Mono. Newt.CACG Cheb.CACG CG
s=4 203 196 197
669
s=8 157 102 99
Before After
𝑠
=
4
=
8
Class. 2 M 1 N 1 C 1 # Replacements Class. 2 M 5 N 2 # ReplacementsPreconditioning for CA-KSMs
• Tradeoff: speed up convergence, but increase time per iteration due to
communication!
• For each specific app, must evaluate tradeoff between preconditioner quality and sparsity of the system
• Good news: many preconditioners allow communication-avoiding approach
• Block Jacobi – block diagonals
• Sparse Approx. Inverse (SAI) – same sparsity as ; recent work for CA-BICGSTAB by Mehri (2014)
• Polynomial preconditioning (Saad, 1985)
• HSS preconditioning for banded matrices (Hoemmen, 2010), (Knight, C., Demmel, 2014)
• CA-ILU(0) – recent work by Moufawad and Grigori (2013)
• Deflation for CA-CG (C., Knight, Demmel, 2014), based on Deflated
CG of (Saad et al., 2000); for CA-GMRES (Yamazaki et al., 2014)
Deflated CA-CG, model problem
Monomial Basis,
Matrix: 2D Laplacian(512), . Right hand side set such that true solution has entries .
Eigenvalue problems:
CA-Lanczos
33
Problem: 2D Poisson, , , with random starting vector
Eigenvalue problems:
CA-Lanczos
Problem: 2D Poisson, , , with random starting vector
, ,
Paige’s Lanczos convergence
analysis
Classic Lanczos rounding error result of Paige (1976):
These results form the basis for Paige’s influential results in (Paige, 1980). 𝜀0=𝑂 (𝜀 𝑛) 𝜀1=𝑂(𝜀 𝑁 𝜃) for ,
35
CA-Lanczos convergence analysis
for ,
For CA-Lanczos, we have:
(vs. for Lanczos)
(vs. for Lanczos)
,
The amplification term
•
Our definition of amplification term before was
where we want to hold for the computed basis and any coordinate
vector in every iteration.
•
Better, more descriptive estimate for updated possible w/tighter
bounds; requires some light bookkeeping
•
Example: for bounds on and , we can use the definition
^
𝛽𝑖+1
|
𝑣^𝑖𝑇𝑣^𝑖+1|
|
𝑣^𝑖𝑇+1𝑣^𝑖+1−1|
Measured value Upper bound (Paige 1976)
39
^
𝛽𝑖+1
|
𝑣^𝑖𝑇𝑣^𝑖+1|
|
𝑣^𝑖𝑇+1𝑣^𝑖+1−1|
Problem: 2D Poisson, , , with random starting vector
Measured value Upper bound value
^
𝛽𝑖+1
|
𝑣^𝑖𝑇𝑣^𝑖+1|
|
𝑣^𝑖𝑇+1𝑣^𝑖+1−1|
Measured value Upper bound value
41
^
𝛽𝑖+1
|
𝑣^𝑖𝑇𝑣^𝑖+1|
|
𝑣^𝑖𝑇+1𝑣^𝑖+1−1|
Problem: 2D Poisson, , , with random starting vector
Measured value Upper bound value
Paige’s results for classical Lanczos
•
Using bounds on local rounding errors in Lanczos, Paige showed that
1. The computed Ritz values always lie between the extreme
eigenvalues of to within a small multiple of machine precision.
2. At least one small interval containing an eigenvalue of is found
by the th iteration.
3. The algorithm behaves numerically like Lanczos with full
reorthogonalization until a very close eigenvalue
approximation is found.
4. The loss of orthogonality among basis vectors follows a
rigorous pattern and implies that some Ritz values have
converged.
•
The answer is
YES!
•
Only if:
•
is numerically full rank for and
•
i.e.,
•
Otherwise, e.g., can lose orthogonality due to computation
with rank-deficient basis
•
How can we use this bound on to design a better algorithm?
Results for CA-Lanczos
43
Ideas based on CA-Lanczos analysis
• Explicit basis orthogonalization
• Compute TSQR on in outer loop, use factor as basis
• Get for only one reduction • Dynamically changing basis size
• Run incremental condition estimate when computing -step basis; stop when is reached. Use this basis of size in this outer loop.
• Mixed precision
• For inner products, use precision such that