ICMESeminar_short.pptx

(1)

Communication-avoiding

Krylov subspace methods

in finite precision

Erin Carson, UC Berkeley

Advisor: James Demmel

(2)

Model Problem: 2D Poisson on grid, . Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution .

Roundoff error can cause a decrease in attainable accuracy of Krylov subspace methods.

This affect can be worse for “communication-avoiding” (-step) Krylov subspace methods – limits practice applicability!

Residual replacement strategy of van der Vorst and Ye (1999) improves attainable accuracy for classical Krylov methods

We can extend this strategy to communication-avoiding variants. Accuracy can be improved for minimal performance cost!

(3)

What is communication?

• Algorithms have two costs: communication and computation

• Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)

sequential comm.

parallel comm.

• _{On modern computers, communication is expensive, computation is cheap} – Flop time << 1/bandwidth << latency

– _{Communication bottleneck a barrier to achieving scalability} – _{Also expensive in terms of energy cost!}

• _{Towards exascale: we}_{must redesign algorithms to avoid communication}

(4)

Communication bottleneck in KSMs

2. Orthogonalize (with respect to some )

• _{Inner products}

• Parallel: global reduction Projection process in each iteration: 1. Add a dimension to

• Sparse Matrix-Vector Multiplication (SpMV)

• Parallel: comm. vector entries w/ neighbors

• Sequential: read /vectors from slow memory

Dependencies between communication-bound

kernels in each iteration limit

SpMV orthogonalize

• A Krylov Subspace Method is a projection process onto the Krylov subspace

• _{Linear systems}_,_{eigenvalue problems}_{, singular value problems, least squares, etc.}

(5)

Example: Classical conjugate

gradient (CG)

SpMVs and inner products

require communication in

each iteration!

for solving

Let

for until convergence do

end for

(6)

Related references:

(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der

Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel, 2013), (Feuerriegel and Bücker, 2013).

• _{Krylov methods can be reorganized to}_{reduce communication cost by} • _{Many CA-KSMs (-step KSMs) in the literature: CG, GMRES, Orthomin,}

MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR

Communication-avoiding CA-KSMs

• _{Reduction in communication can translate to speedups on practical problems} • _{Recent results:}_{x speedup}_{with for CA-BICGSTAB in GMG bottom-solve;}

(7)

Benchmark timing breakdown

7

• _{Plot: Net time spent on different operations over one MG solve using}

cores

• _{Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini}

network, 3D torus

• _{3D Helmholtz equation}

(8)

CA-CG overview

Starting at iteration for , it can be shown that to compute the next steps (iterations where ),

.

1. Compute basis matrix

of dimension -by-, where is a polynomial of degree . This gives the recurrence relation

,

where is -by- and is 2x2 block diagonal with upper Hessenberg blocks, and is with columns and set to 0.

• _{Communication cost:}_{Same latency cost as}_{one SpMV}_using_{matrix powers}

(9)

Avoids communication:

• In serial, by exploiting temporal locality:

• Reading , reading

• In parallel, by doing only 1 ‘expand’ phase (instead of ).

• Requires sufficiently low ‘surface-to-volume’ ratio

Tridiagonal Example:

The matrix powers kernel

e Matrix Powers

9 Sequential

Parallel A3_v

A2_v

Av v A3_v

A2_v

Av v

black = local elements

red = 1-level dependencies

green = 2-level dependencies

blue = 3-level dependencies

(10)

2. Orthogonalize:

Encode dot products between basis vectors by

computing Gram matrix of dimension -by- (or compute Tall-Skinny

QR)

• _{Communication cost}

_{: of one global reduction}

3. Perform iterations of updates

• _{Using and , this requires}

_{no communication}

• _{Represent -vectors by their length- coordinates in}

• _{Perform iterations of updates on coordinate vectors}

4. Basis change

to recover CG vectors

• _{Compute locally,}

_{no communication}

(11)

Compute such that Compute

[

𝑥

𝑠𝑘+𝑠

−

𝑥

𝑠𝑘

,

𝑟

𝑠𝑘+𝑠

,

𝑝

𝑠𝑘+𝑠

]

=𝑌

𝑘

[

𝑥

¿ ¿

𝑘

,

𝑠

′

_,

_𝑟

𝑘,𝑠

′

_,

_𝑝

𝑘,𝑠

′

]

¿

for do for do

end for end for

CG

CA-CG

(12)

Krylov methods in finite precision

• Finite precision errors have effects on algorithm behavior (known to

Lanczos (1950); see, e.g., Meurant and Strakoš (2006) for in-depth survey)

• Lanczos:

• _{Basis vectors lose orthogonality}

• _{Appearance of multiple Ritz approximations to some eigenvalues}

• _{Delays convergence}_{of Ritz values to other eigenvalues}

• CG:

• Residual vectors (proportional to Lanczos basis vectors) lose orthogonality

• Appearance of duplicate Ritz values delays convergence of CG approximate solution

• _{Residual vectors and true residuals deviate}

(13)

• _{CA-KSMs mathematically equivalent to classical KSMs}

• _{But convergence delay and loss of accuracy gets worse with}

increasing !

• _{Obstacle to solving practical problems}

• _{Decrease in attainable accuracy some problems that KSM}

can solve can’t be solved with CA variant

• _{Delay of convergence if # iterations increases more than}

time per iteration decreases due to CA techniques,

no

speedup expected!

CA-Krylov methods in finite precision

(14)

This raises the questions…

For CA-KSMs,

• _{How bad can the effects of roundoff error be?}

• _{And what can we do about it?}

For solving linear systems:

• _{Bound on maximum attainable accuracy for CG and}

CA-BICG

• _{Residual replacement strategy: uses bound to improve}

(15)

For CA-KSMs,

• _{How bad can the effects of roundoff error be?}

• _{And what can we do about it?}

For eigenvalue problems:

• _{Some ideas on how to use this new analysis…}

• _{Basis orthogonalization}

• _{Dynamically select/change basis size ( parameter)}

• _{Mixed precision variants}

• _{Extension of Chris Paige’s analysis of Lanczos to CA variants:}

• _{Bounds on local rounding errors in CA-Lanczos}

• _{Bounds on accuracy and convergence of Ritz values}

• _{Loss of orthogonality}

_

_{convergence of Ritz values}

This raises the questions…

(16)

Maximum attainable accuracy of CG

• In classical CG, iterates are updated by and

• Formulas for and do not depend on each other - rounding errors cause the true residual, , and the updated residual, , to deviate

• The size of the true residual is bounded by

• When , and have similar magnitude

• When , depends on

(17)

Example: Comparison of convergence of true and updated residuals for CG vs. CA-CG using a monomial basis, for various values

Model problem (2D Poisson on grid)

(18)

• Better conditioned polynomial bases can be used instead of monomial. • _{Two common choices:}_Newton_and

Chebyshev - see, e.g., (Philippe and Reichel, 2012).

(19)

Residual replacement strategy for CG

• _{van der Vorst and Ye (1999): Improve accuracy by replacing}_updated

residual by the true residual in certain iterations, combined with group update.

• _{Choose when to replace with to meet two constraints:}

1. Replace often enough so that at termination, is small relative to

2. Don’t replace so often that original convergence mechanism of updated residuals is destroyed (avoid large perturbations to finite precision CG recurrence)

(20)

Residual replacement condition for CG

Tong and Ye (2000): In finite precision classical CG, in iteration , the

computed residual and search direction vectors satisfy

,

𝑝

^

𝑚

= ^

𝑟

𝑚

+

𝛽

𝑚

𝑝

^

𝑚−1

+

𝜏

𝑚

Then in matrix form,

𝐴 𝑍_𝑚=𝑍_𝑚𝑇_𝑚− 1

𝛼_𝑚′

^

𝑟_𝑚₊₁

‖

𝑟^₀

_‖

𝑒𝑚+1

𝑇

+ 𝐹_𝑚 𝑍_𝑚₌

[

𝑟^0

‖

𝑟^₀

_‖

, … , ^

𝑟𝑚

‖

𝑟^_𝑚

_‖

]

with

where is invertible and tridiagonal, and , with

𝑓 _𝑚= 𝐴 𝜏𝑚

‖

𝑟^_𝑚

_‖

+

1

𝛼_𝑚

𝜂_𝑚₊₁

‖

𝑟^_𝑚

_‖

−

𝛽_𝑚 𝛼_𝑚₋₁

𝜂_𝑚

(21)

21

Residual replacement condition for CG

Tong and Ye (2000): If sequence satisfies

,

• _{As long as satisfies recurrence, bound on its norm holds regardless of}

how it is generated

• _{Can replace by and still expect convergence whenever}

is not too large relative to and (van der Vorst & Ye, 1999).

‖

𝑟^_𝑚₊₁

_‖

≤

₍

1+ 𝒦 _𝑚

₎

min

𝜌 ∈𝒫 _𝑚,𝜌(0)=1

‖ 𝜌

₍

𝐴+ Δ 𝐴_𝑚

₎

𝑟^₁‖ where and .

(22)

• _{(van der Vorst and Ye, 1999): Use computable bound for to update , an}

estimate of error , in each iteration: ,

• Set threshold , replace whenever reaches threshold

Residual replacement strategy for CG

if

end

• _{Assuming a bound on condition number of , if updated residual}

Pseudo-code for residual replacement with group update for CG:

group update of approximate solution

(23)

Error in basis change

Sources of roundoff error in CA-CG

Error in computing -step basis

Error in updating coefficient vectors

Computing the -step Krylov basis:

Updating coordinate vectors in the inner loop:

with

Recovering CG vectors for use in next outer loop:

(24)

• We can write the deviation of the true and updated residuals in terms of these errors:

Maximum attainable accuracy of CA-CG

(25)

• We extend van der Vorst and Ye’s residual replacement strategy to CA-CG

• Making use of the bound for in CA-CG, update error estimate by:

A computable bound

otherwise

𝑗

=

𝑠

where

Estimated only once

flops per iterations; no communication

flops per iterations; 1 reduction per iterations

Extra computation all lower order terms, communication increased by at most factor of 2.

(26)

if

break from inner loop and begin new outer loop end

Residual replacement for CA-CG

• Use the same replacement condition as van der Vorst and Ye (1999):

Pseudo-code for residual replacement with group update for CA-CG:

(27)

(28)

CACG Mono. Newt.CACG Cheb.CACG CG

s=4 354 353 365

355 s=8 224, 334, 401, 517 340 353

s=12 135, 2119 326 346 Residual Replacement Indices

• _{# replacements small compared to total} Total Number of Reductions

• In addition to attainable accuracy,

CACG Mono. Newt.CACG Cheb.CACG CG

s=4 203 196 197

669

s=8 ₁₅₇ ₁₀₂ ₉₉

(29)

(30)

Before After

𝑠

=

4 =

8

Class. 2 M 1 N 1 C 1 # Replacements Class. 2 M 5 N 2 # Replacements

(31)

Preconditioning for CA-KSMs

• _{Tradeoff: speed up convergence, but increase time per iteration due to}

communication!

• For each specific app, must evaluate tradeoff between preconditioner quality and sparsity of the system

• Good news: many preconditioners allow communication-avoiding approach

• Block Jacobi – block diagonals

• Sparse Approx. Inverse (SAI) – same sparsity as ; recent work for CA-BICGSTAB by Mehri (2014)

• Polynomial preconditioning (Saad, 1985)

• HSS preconditioning for banded matrices (Hoemmen, 2010), (Knight, C., Demmel, 2014)

• _CA-ILU(0)_{– recent work by Moufawad and Grigori (2013)}

• _Deflation_{for CA-CG (C., Knight, Demmel, 2014), based on Deflated}

CG of (Saad et al., 2000); for CA-GMRES (Yamazaki et al., 2014)

(32)

Deflated CA-CG, model problem

Monomial Basis,

Matrix: 2D Laplacian(512), . Right hand side set such that true solution has entries .

(33)

Eigenvalue problems:

CA-Lanczos

33

Problem: 2D Poisson, , , with random starting vector

(34)

Eigenvalue problems:

CA-Lanczos

(35)

, ,

Paige’s Lanczos convergence

analysis

Classic Lanczos rounding error result of Paige (1976):

 These results form the basis for Paige’s influential results in (Paige, 1980). 𝜀₀=𝑂 (𝜀 𝑛) 𝜀₁=𝑂(𝜀 𝑁 𝜃) for ,

35

(36)

CA-Lanczos convergence analysis

for ,

For CA-Lanczos, we have:

(vs. for Lanczos)

,

(37)

The amplification term

• _{Our definition of amplification term before was}

where we want to hold for the computed basis and any coordinate

vector in every iteration.

• _{Better, more descriptive estimate for updated possible w/tighter}

bounds; requires some light bookkeeping

• _{Example: for bounds on and , we can use the definition}

(38)

^

𝛽_𝑖₊₁

_|

𝑣^_𝑖𝑇𝑣^_𝑖₊₁

|

𝑣^_𝑖𝑇₊₁𝑣^_𝑖₊₁₋₁

|

Measured value Upper bound (Paige 1976)

(39)

39

^

𝛽_𝑖₊₁

_|

𝑣^_𝑖𝑇𝑣^_𝑖₊₁

|

𝑣^_𝑖𝑇₊₁𝑣^_𝑖₊₁₋₁

|

Measured value Upper bound value

(40)

^

𝛽_𝑖₊₁

_|

𝑣^_𝑖𝑇𝑣^_𝑖₊₁

|

𝑣^_𝑖𝑇₊₁𝑣^_𝑖₊₁₋₁

|

(41)

41

^

𝛽_𝑖₊₁

_|

𝑣^_𝑖𝑇𝑣^_𝑖₊₁

|

𝑣^_𝑖𝑇₊₁𝑣^_𝑖₊₁₋₁

|

(42)

Paige’s results for classical Lanczos

• _{Using bounds on local rounding errors in Lanczos, Paige showed that}

1. The computed Ritz values always lie between the extreme

eigenvalues of to within a small multiple of machine precision.

2. At least one small interval containing an eigenvalue of is found

by the th iteration.

3. The algorithm behaves numerically like Lanczos with full

reorthogonalization until a very close eigenvalue

approximation is found.

4. The loss of orthogonality among basis vectors follows a

rigorous pattern and implies that some Ritz values have

converged.

(43)

• The answer is

YES!

• _{Only if:}

• _{is numerically full rank for and}

• _i.e.,

• _{Otherwise, e.g., can lose orthogonality due to computation}

with rank-deficient basis

• _{How can we use this bound on to design a better algorithm?}

Results for CA-Lanczos

43

(44)

Ideas based on CA-Lanczos analysis

• Explicit basis orthogonalization

• Compute TSQR on in outer loop, use factor as basis

• _{Get for only one reduction} • Dynamically changing basis size

• Run incremental condition estimate when computing -step basis; stop when is reached. Use this basis of size in this outer loop.