• No se han encontrado resultados

ICMESeminar_short.pptx

N/A
N/A
Protected

Academic year: 2020

Share "ICMESeminar_short.pptx"

Copied!
46
0
0

Texto completo

(1)

Communication-avoiding

Krylov subspace methods

in finite precision

Erin Carson, UC Berkeley

Advisor: James Demmel

(2)

Model Problem: 2D Poisson on grid, . Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution .

Roundoff error can cause a decrease in attainable accuracy of Krylov subspace methods.

This affect can be worse for “communication-avoiding” (-step) Krylov subspace methods – limits practice applicability!

Residual replacement strategy of van der Vorst and Ye (1999) improves attainable accuracy for classical Krylov methods

We can extend this strategy to communication-avoiding variants. Accuracy can be improved for minimal performance cost!

(3)

What is communication?

• Algorithms have two costs: communication and computation

Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)

sequential comm.

parallel comm.

On modern computers, communication is expensive, computation is cheap – Flop time << 1/bandwidth << latency

Communication bottleneck a barrier to achieving scalabilityAlso expensive in terms of energy cost!

Towards exascale: we must redesign algorithms to avoid communication

(4)

Communication bottleneck in KSMs

2. Orthogonalize (with respect to some )

Inner products

• Parallel: global reduction Projection process in each iteration: 1. Add a dimension to

• Sparse Matrix-Vector Multiplication (SpMV)

• Parallel: comm. vector entries w/ neighbors

• Sequential: read /vectors from slow memory

Dependencies between communication-bound

kernels in each iteration limit

SpMV orthogonalize

• A Krylov Subspace Method is a projection process onto the Krylov subspace

Linear systems, eigenvalue problems, singular value problems, least squares, etc.

(5)

Example: Classical conjugate

gradient (CG)

SpMVs and inner products

require communication in

each iteration!

for solving

Let

for until convergence do

end for

(6)

Related references:

(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der

Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel, 2013), (Feuerriegel and Bücker, 2013).

Krylov methods can be reorganized to reduce communication cost by Many CA-KSMs (-step KSMs) in the literature: CG, GMRES, Orthomin,

MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR

Communication-avoiding CA-KSMs

Reduction in communication can translate to speedups on practical problemsRecent results: x speedup with for CA-BICGSTAB in GMG bottom-solve;

(7)

Benchmark timing breakdown

7

Plot: Net time spent on different operations over one MG solve using

cores

Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini

network, 3D torus

3D Helmholtz equation

(8)

CA-CG overview

Starting at iteration for , it can be shown that to compute the next steps (iterations where ),

.

1. Compute basis matrix

of dimension -by-, where is a polynomial of degree . This gives the recurrence relation

,

where is -by- and is 2x2 block diagonal with upper Hessenberg blocks, and is with columns and set to 0.

Communication cost: Same latency cost as one SpMV using matrix powers

(9)

Avoids communication:

• In serial, by exploiting temporal locality:

• Reading , reading

• In parallel, by doing only 1 ‘expand’ phase (instead of ).

• Requires sufficiently low ‘surface-to-volume’ ratio

Tridiagonal Example:

The matrix powers kernel

e Matrix Powers

9 Sequential

Parallel A3v

A2v

Av v A3v

A2v

Av v

black = local elements

red = 1-level dependencies

green = 2-level dependencies

blue = 3-level dependencies

(10)

2. Orthogonalize:

Encode dot products between basis vectors by

computing Gram matrix of dimension -by- (or compute Tall-Skinny

QR)

Communication cost

: of one global reduction

3. Perform iterations of updates

Using and , this requires

no communication

Represent -vectors by their length- coordinates in

Perform iterations of updates on coordinate vectors

4. Basis change

to recover CG vectors

Compute locally,

no communication

(11)

Compute such that Compute

[

𝑥

𝑠𝑘+𝑠

𝑥

𝑠𝑘

𝑟

𝑠𝑘+𝑠

,

𝑝

𝑠𝑘+𝑠

]

=𝑌

𝑘

[

𝑥

¿ ¿

𝑘

,

𝑠

,

𝑟

𝑘,𝑠

,

𝑝

𝑘,𝑠

]

¿

for do for do

end for end for

CG

CA-CG

(12)

Krylov methods in finite precision

• Finite precision errors have effects on algorithm behavior (known to

Lanczos (1950); see, e.g., Meurant and Strakoš (2006) for in-depth survey)

• Lanczos:

Basis vectors lose orthogonality

Appearance of multiple Ritz approximations to some eigenvalues

Delays convergence of Ritz values to other eigenvalues

• CG:

• Residual vectors (proportional to Lanczos basis vectors) lose orthogonality

• Appearance of duplicate Ritz values delays convergence of CG approximate solution

Residual vectors and true residuals deviate

(13)

CA-KSMs mathematically equivalent to classical KSMs

But convergence delay and loss of accuracy gets worse with

increasing !

Obstacle to solving practical problems

Decrease in attainable accuracy some problems that KSM

can solve can’t be solved with CA variant

Delay of convergence if # iterations increases more than

time per iteration decreases due to CA techniques,

no

speedup expected!

CA-Krylov methods in finite precision

(14)

This raises the questions…

For CA-KSMs,

How bad can the effects of roundoff error be?

And what can we do about it?

For solving linear systems:

Bound on maximum attainable accuracy for CG and

CA-BICG

Residual replacement strategy: uses bound to improve

(15)

For CA-KSMs,

How bad can the effects of roundoff error be?

And what can we do about it?

For eigenvalue problems:

Some ideas on how to use this new analysis…

Basis orthogonalization

Dynamically select/change basis size ( parameter)

Mixed precision variants

Extension of Chris Paige’s analysis of Lanczos to CA variants:

Bounds on local rounding errors in CA-Lanczos

Bounds on accuracy and convergence of Ritz values

Loss of orthogonality

convergence of Ritz values

This raises the questions…

(16)

Maximum attainable accuracy of CG

• In classical CG, iterates are updated by and

• Formulas for and do not depend on each other - rounding errors cause the true residual, , and the updated residual, , to deviate

• The size of the true residual is bounded by

• When , and have similar magnitude

• When , depends on

(17)

Example: Comparison of convergence of true and updated residuals for CG vs. CA-CG using a monomial basis, for various values

Model problem (2D Poisson on grid)

(18)

• Better conditioned polynomial bases can be used instead of monomial. • Two common choices: Newton and

Chebyshev - see, e.g., (Philippe and Reichel, 2012).

(19)

Residual replacement strategy for CG

van der Vorst and Ye (1999): Improve accuracy by replacing updated

residual by the true residual in certain iterations, combined with group update.

Choose when to replace with to meet two constraints:

1. Replace often enough so that at termination, is small relative to

2. Don’t replace so often that original convergence mechanism of updated residuals is destroyed (avoid large perturbations to finite precision CG recurrence)

(20)

Residual replacement condition for CG

Tong and Ye (2000): In finite precision classical CG, in iteration , the

computed residual and search direction vectors satisfy

,

𝑝

^

𝑚

= ^

𝑟

𝑚

+

𝛽

𝑚

𝑝

^

𝑚−1

+

𝜏

𝑚

Then in matrix form,

𝐴 𝑍𝑚=𝑍𝑚𝑇𝑚 1

𝛼𝑚

^

𝑟𝑚+1

𝑟^0

𝑒𝑚+1

𝑇

+ 𝐹𝑚 𝑍𝑚=

[

𝑟^0

𝑟^0

, … , ^

𝑟𝑚

𝑟^𝑚

]

with

where is invertible and tridiagonal, and , with

𝑓 𝑚= 𝐴 𝜏𝑚

𝑟^𝑚

+

1

𝛼𝑚

𝜂𝑚+1

𝑟^𝑚

𝛽𝑚 𝛼𝑚1

𝜂𝑚

(21)

21

Residual replacement condition for CG

Tong and Ye (2000): If sequence satisfies

,

As long as satisfies recurrence, bound on its norm holds regardless of

how it is generated

Can replace by and still expect convergence whenever

is not too large relative to and (van der Vorst & Ye, 1999).

𝑟^𝑚+1

(

1+ 𝒦 𝑚

)

min

𝜌 ∈𝒫 𝑚,𝜌(0)=1

𝜌

(

𝐴+ Δ 𝐴𝑚

)

𝑟^1 where and .

(22)

(van der Vorst and Ye, 1999): Use computable bound for to update , an

estimate of error , in each iteration: ,

• Set threshold , replace whenever reaches threshold

Residual replacement strategy for CG

if

end

Assuming a bound on condition number of , if updated residual

Pseudo-code for residual replacement with group update for CG:

group update of approximate solution

(23)

Error in basis change

Sources of roundoff error in CA-CG

Error in computing -step basis

Error in updating coefficient vectors

Computing the -step Krylov basis:

Updating coordinate vectors in the inner loop:

with

Recovering CG vectors for use in next outer loop:

(24)

• We can write the deviation of the true and updated residuals in terms of these errors:

Maximum attainable accuracy of CA-CG

(25)

• We extend van der Vorst and Ye’s residual replacement strategy to CA-CG

• Making use of the bound for in CA-CG, update error estimate by:

A computable bound

otherwise

𝑗

=

𝑠

where

Estimated only once

flops per iterations; no communication

flops per iterations; 1 reduction per iterations

Extra computation all lower order terms, communication increased by at most factor of 2.

(26)

if

break from inner loop and begin new outer loop end

Residual replacement for CA-CG

• Use the same replacement condition as van der Vorst and Ye (1999):

Pseudo-code for residual replacement with group update for CA-CG:

(27)
(28)

CACG Mono. Newt.CACG Cheb.CACG CG

s=4 354 353 365

355 s=8 224, 334, 401, 517 340 353

s=12 135, 2119 326 346 Residual Replacement Indices

# replacements small compared to total Total Number of Reductions

In addition to attainable accuracy,

CACG Mono. Newt.CACG Cheb.CACG CG

s=4 203 196 197

669

s=8 157 102 99

(29)
(30)

Before After

𝑠

=

4

=

8

Class. 2 M 1 N 1 C 1 # Replacements Class. 2 M 5 N 2 # Replacements

(31)

Preconditioning for CA-KSMs

Tradeoff: speed up convergence, but increase time per iteration due to

communication!

• For each specific app, must evaluate tradeoff between preconditioner quality and sparsity of the system

• Good news: many preconditioners allow communication-avoiding approach

Block Jacobi – block diagonals

Sparse Approx. Inverse (SAI) – same sparsity as ; recent work for CA-BICGSTAB by Mehri (2014)

Polynomial preconditioning (Saad, 1985)

HSS preconditioning for banded matrices (Hoemmen, 2010), (Knight, C., Demmel, 2014)

CA-ILU(0) – recent work by Moufawad and Grigori (2013)

Deflation for CA-CG (C., Knight, Demmel, 2014), based on Deflated

CG of (Saad et al., 2000); for CA-GMRES (Yamazaki et al., 2014)

(32)

Deflated CA-CG, model problem

Monomial Basis,

Matrix: 2D Laplacian(512), . Right hand side set such that true solution has entries .

(33)

Eigenvalue problems:

CA-Lanczos

33

Problem: 2D Poisson, , , with random starting vector

(34)

Eigenvalue problems:

CA-Lanczos

Problem: 2D Poisson, , , with random starting vector

(35)

, ,

Paige’s Lanczos convergence

analysis

Classic Lanczos rounding error result of Paige (1976):

 These results form the basis for Paige’s influential results in (Paige, 1980). 𝜀0=𝑂 (𝜀 𝑛) 𝜀1=𝑂(𝜀 𝑁 𝜃) for ,

35

(36)

CA-Lanczos convergence analysis

for ,

For CA-Lanczos, we have:

(vs. for Lanczos)

(vs. for Lanczos)

,

(37)

The amplification term

Our definition of amplification term before was

where we want to hold for the computed basis and any coordinate

vector in every iteration.

Better, more descriptive estimate for updated possible w/tighter

bounds; requires some light bookkeeping

Example: for bounds on and , we can use the definition

(38)

^

𝛽𝑖+1

|

𝑣^𝑖𝑇𝑣^𝑖+1

|

|

𝑣^𝑖𝑇+1𝑣^𝑖+11

|

Measured value Upper bound (Paige 1976)

(39)

39

^

𝛽𝑖+1

|

𝑣^𝑖𝑇𝑣^𝑖+1

|

|

𝑣^𝑖𝑇+1𝑣^𝑖+11

|

Problem: 2D Poisson, , , with random starting vector

Measured value Upper bound value

(40)

^

𝛽𝑖+1

|

𝑣^𝑖𝑇𝑣^𝑖+1

|

|

𝑣^𝑖𝑇+1𝑣^𝑖+11

|

Measured value Upper bound value

(41)

41

^

𝛽𝑖+1

|

𝑣^𝑖𝑇𝑣^𝑖+1

|

|

𝑣^𝑖𝑇+1𝑣^𝑖+11

|

Problem: 2D Poisson, , , with random starting vector

Measured value Upper bound value

(42)

Paige’s results for classical Lanczos

Using bounds on local rounding errors in Lanczos, Paige showed that

1. The computed Ritz values always lie between the extreme

eigenvalues of to within a small multiple of machine precision.

2. At least one small interval containing an eigenvalue of is found

by the th iteration.

3. The algorithm behaves numerically like Lanczos with full

reorthogonalization until a very close eigenvalue

approximation is found.

4. The loss of orthogonality among basis vectors follows a

rigorous pattern and implies that some Ritz values have

converged.

(43)

The answer is

YES!

Only if:

is numerically full rank for and

i.e.,

Otherwise, e.g., can lose orthogonality due to computation

with rank-deficient basis

How can we use this bound on to design a better algorithm?

Results for CA-Lanczos

43

(44)

Ideas based on CA-Lanczos analysis

• Explicit basis orthogonalization

• Compute TSQR on in outer loop, use factor as basis

Get for only one reduction • Dynamically changing basis size

• Run incremental condition estimate when computing -step basis; stop when is reached. Use this basis of size in this outer loop.

• Mixed precision

• For inner products, use precision such that

(45)

Summary

For high performance iterative methods,

both time per iteration

and number of iterations required

are important.

CA-KSMs offer asymptotic reductions in time per iteration, but

can have the effect of increasing number of iterations required

(or causing instability)

For CA-KSMs to be practical, must better understand finite

precision behavior

(theoretically and empirically), and develop

ways to improve the methods

Ongoing efforts in: finite precision convergence analysis for

CA-KSMs, selective use of extended precision, development

of CA preconditioners, other new techniques.

(46)

Thank you!

contact: [email protected]

Referencias

Documento similar

The analysis of the stories published on the social networks Facebook and Twitter by the two most important newspapers in Spain, El País and El Mundo, allows us to detect a

This goes beyond the utterly analytic treatment in the classical (van der Corput’s) theory of exponential sums and vaguely resembles to the situation in [1] (the seminal paper for

With this thesis we make several novel contributions to storage research: first we design and evaluate a pseudo-randomized distribution strategy that can adapt to hardware changes

Abstract: We extend the classical Stallings theory (describing subgroups of free groups as automata) to direct products of free and abelian groups: after introducing enriched

The main aim of the proposed trial is to determine whether a health promotion strategy involving an educa- tional music concert for schoolchildren aged 7–8 years improves knowledge

This combination increases the salience of melody pitches and improves melody ex- traction accuracy over previous approaches, with two different contour-based melody tracking

The communication model presented by Pinazo and Nos-Aldás (2016) suggests that motivation in favor of a cause is modulated by a communication strategy associated to the context

1) Correlative microscopy is a helpful strategy to validate super-resolution fluorescence microscopy methods. The experiments performed in this thesis show how