• No se han encontrado resultados

BASCD2013Talk.pptx

N/A
N/A
Protected

Academic year: 2020

Share "BASCD2013Talk.pptx"

Copied!
36
0
0

Texto completo

(1)

Erin Carson and James Demmel

U.C. Berkeley

Bay Area Scientific Computing Day December 11, 2013

(2)

What is Communication?

Algorithms have two costs: communication and computation

Communication : moving data between levels of memory

hierarchy (sequential), between processors (parallel)

2

sequential comm.

parallel comm.

On modern computers, communication is expensive, computation is cheapFlop time << 1/bandwidth << latency

Communication bottleneck a barrier to achieving scalabilityCommunication is also expensive in terms of energy cost

(3)

Krylov Subspace Iterative Methods

Methods for solving linear systems, eigenvalue problems, singular value problems, least squares, etc.

Best choice when is very large and very sparse, stored implicitly, or only approximate answer needed

Based on projection onto the expanding Krylov subspace

Used in numerous application areas

Image/contour detection, modal analysis, solving PDEs - combustion, climate, astrophysics, etc…

• Named among “Top 10” algorithmic ideas of the 20th century

Along with FFT, fast multipole, quicksort, Monte Carlo, simplex method, etc. (SIAM News, Vol.33, No. 4, 2000)

(4)

Communication Bottleneck in KSMs

2. Orthogonalize

Vector operations (inner products, axpys)

Parallel: global reduction

Sequential: multiple reads/writes to

slow memory

4

To carry out the projection process, in each iteration, we

1. Add a dimension to the Krylov subspace

Requires Sparse Matrix-Vector Multiplication (SpMV)

Parallel: communicate vector entries with neighbors

Sequential: read (and -vectors) from slow memory

Dependencies between communication-bound kernels in each

iteration limit performance!

SpMV

(5)

5

Example: Classical Conjugate Gradient (CG)

SpMVs and dot products

require communication in

each iteration

for solving

Let

for until convergence do

(6)

• CA-KSMs, based on -step formulations, reduce communication by

– First instance: s-step CG by Van Rosendale (1983); for thorough overview see thesis of Hoemmen (2010)

• Main idea: Unroll iteration loop by a factor of . By induction, for

Communication-Avoiding KSMs

6

1. Compute basis for upfront

• If is well partitioned, requires reading /communicating vectors only once (via matrix powers kernel)

2. Orthogonalize: Encode dot products between basis vectors with Gram matrix (or compute Tall-Skinny QR)

• Communication cost of one global reduction

3. Perform s iterations of updates

Using and , this requires no communication

Represent -vectors by their coordinates in

(7)

via CA Matrix Powers Kernel

Global reduction to compute G

7

Example: CA-Conjugate Gradient

Local computations within inner loop require

no communication!

for solving Let

for until convergence do

Compute such that , compute Let

for do

end for

(8)

Behavior of CA-KSMs in Finite Precision

Runtime = (time per iteration) x (number of iterations until convergence)

Communication-avoiding approach can reduce time per iteration by O(s)But how do convergence rates compare to classical methods?

CA-KSMs are mathematically equivalent to classical KSMs, but can behave much differently in finite precision!

Roundoff errors have two discernable effects:

1. Delay of convergence if # iterations increases more than time per iteration decreases due to CA techniques, no speedup expected 2. Decrease in attainable accuracy due to deviation of the true

residual, , and the updated residual, some problems that KSM can solve can’t be solved with CA variant

– Roundoff error bounds generally grow with increasing • Obstacles to the practical use of CA-KSMs

(9)

CA-CG Convergence, s = 4 CA-CG Convergence, s = 8

0

CA-CG Convergence, s = 16

Slower

convergence due to roundoff

Loss of accuracy due to roundoff

At s = 16, monomial basis is rank deficient! Method breaks down! CG true

CG updated

CA-CG (monomial) true CA-CG (monomial) updated

(10)

Improving Convergence Rate and Accuracy

We will discuss two techniques for alleviating the effects of

roundoff:

1. Improve convergence by using

better-conditioned

polynomial bases

2. Use a

residual replacement strategy

to improve

attainable accuracy

(11)

Choosing a Polynomial Basis

Recall: in each outer loop of CA-CG, we compute bases for some Krylov

subspace,

Simple loop unrolling leads to the choice of monomials

Monomial basis condition number can grow exponentially with -

expected (near) linear dependence of basis vectors for modest values

Recognized early on that this negatively affects convergence

(Chronopoulous & Swanson, 1995)

Improve basis condition number to improve convergence: Use different

polynomials to compute a basis for the same subspace.

Two choices based on spectral information that usually lead to

well-conditioned bases:

Newton polynomials Chebyshev polynomials

(12)

Newton Basis

12

The Newton basis can be written

where

are chosen to be approximate eigenvalues of , ordered

according to Leja ordering:

where .

In practice: recover Ritz (Petrov) values from the first few iterations,

iteratively refine eigenvalue estimates to improve basis

(13)

Chebyshev Basis

Chebyshev polynomials frequently used because of minimax property:

Over all real polynomials of degree on interval , the Chebyshev

polynomial of degree minimizes the maximum absolute value on

Given ellipse enclosing spectrum of with foci at , we can generate the

scaled and shifted Chebyshev polynomials as:

where are the Chebyshev polynomials of the first kind

In practice, we can estimate and parameters from Ritz values

recovered from the first few iterations

Used by many to improve s-step variants: e.g., de Sturler (1991),

Joubert and Carey (1992), de Sturler and van der Vorst (2005)

(14)

CA-CG Convergence, s = 4 CA-CG Convergence, s = 8

0

CA-CG Convergence, s = 16

CG true CG updated

CA-CG (monomial) true CA-CG (monomial) updated

(15)

Better basis choice allows higher s values

CA-CG Convergence, s = 4 CA-CG Convergence, s = 8

CA-CG Convergence, s = 16

But can still see loss of accuracy CG true

CG updated

CA-CG (monomial) true CA-CG (monomial) updated

CA-CG (Newton) true CA-CG (Newton) updated CA-CG (Chebyshev) true CA-CG (Chebyshev) updated

(16)

Explicit Orthogonalization

Use communication-avoiding TSQR to orthogonalize matrix powers output • Matrix powers kernel outputs basis

such that

Use TSQR to compute (one reduction)Use orthogonal basis instead of : with

Gram matrix computation unnecessary since

Example: small model problem, 2D Poisson with , CA-CG with monomial basis and

Monomial Basis, s=8

Can improve convergence rate of

updated residual

(17)

Accuracy in Finite Precision

17

In classical CG, iterates are updated by

and

Formulas for and do not depend on each other - rounding errors cause the

true residual, , and the updated residual, , to deviate

The size of the deviation of residuals, , determines the maximum attainable

accuracy

Write the true residual as

Then the size of the true residual is bounded by

When , and have similar magnitudeBut when , depends on

First bound on maximum attainable accuracy for classical KSMs given by

Greenbaum (1992)

We have applied a similar analysis to upper bound the maximum attainable

(18)

Computing the -step Krylov basis:

Updating coordinate vectors in the inner loop:

Recovering CG vectors for use in next outer loop:

18

Sources of Roundoff in CA-KSMs

Error in computing -step basis

Error in updating coefficient vectors

(19)

We can write the deviation of the true and updated residuals in

terms of these errors:

19

Bound on Maximum Attainable Accuracy

This can easily be translated into a computable upper bound

(20)

Residual Replacement Strategy

Based on Van der Vorst and Ye (1999) for classical KSMs

Idea:

Replace updated residual with true residual

at certain

iterations to reduce size of their deviation (thus improving

attainable accuracy)

Result: When the updated residual converges to level , a small

number of replacement steps can

reduce true residual to level

We devise an analogous strategy for CA-CG and CA-BICG based on

derived bound on deviation of residuals

Computing estimates for quantities in bound has

negligible cost

→ residual replacement strategy does not asymptotically

increase communication or computation!

To appear in SIMAX

(21)

CA-CG Convergence, s = 4 CA-CG Convergence, s = 8

CA-CG Convergence, s = 16

CG true CG updated

CA-CG (monomial) true CA-CG (monomial) updated CA-CG (Newton) true

CA-CG (Newton) updated CA-CG (Chebyshev) true CA-CG (Chebyshev) updated

(22)

Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) 10^4

CA-CG Convergence, s = 4 CA-CG Convergence, s = 8

CA-CG Convergence, s = 16

CG-RR true CG-RR updated

CA-CG-RR (monomial) true CA-CG-RR (monomial) updated CA-CG-RR (Newton) true

CA-CG-RR (Newton) updated CA-CG-RR (Chebyshev) true CA-CG-RR (Chebyshev) updated

Residual Replacement can improve accuracy

orders of magnitude for negligible cost Maximum

replacement steps (extra reductions)

(23)

Ongoing Work: Using Extended Precision

23

Precision we choose affects both stability and convergenceErrors in affect convergence rate (Lanczos process)

Errors in computing and affect accuracy (deviation of residuals)

Example: diag(linspace(1,1e6, 100)), random RHS

Precision:

10 decimal

places

Precision:

15 decimal

places

Precision:

20 decimal

places

CA-CG convergence (s = 2, monomial basis)

(24)

Summary and Open Problems

For high performance iterative methods, both time per iteration and

number of iterations required are important.

• CA-KSMs offer asymptotic reductions in time per iteration, but can have the effect of increasing number of iterations required (or causing

instability)

We discuss two ways for reducing roundoff effects for linear system solvers: improving basis conditioning and residual replacement techniques

For CA-KSMs to be practical, must better understand finite precision

behavior (theoretically and empirically), and develop ways to improve the methods

Ongoing efforts in: finite precision convergence analysis for CA-KSMs, selective use of extended precision, development of CA

(25)
(26)

Authors

KSM

Basis

Precond? Mtx Pwrs?

TSQR?

Van Rosendale,

1983 CG Monomial Polynomial No No

Leland, 1989 CG Monomial Polynomial No No

Walker, 1988 GMRES Monomial None No No

Chronopoulos and

Gear, 1989 CG Monomial None No No

Chronopoulos and

Kim, 1990 Orthomin, GMRES Monomial None No No

Chronopoulos,

1991 MINRES Monomial None No No

Kim and Chronopoulos,

1991

Symm. Lanczos,

Arnoldi

Monomial None No No

Sturler, 1991 GMRES Chebyshev None No No

Related Work: -step methods

(27)

Authors

KSM

Basis

Precond?

Mtx Pwrs?

TSQR?

Joubert and

Carey, 1992 GMRES Chebyshev No Yes* No

Chronopoulos

and Kim, 1992 Nonsymm. Lanczos Monomial No No No Bai, Hu, and

Reichel, 1991 GMRES Newton No No No

Erhel, 1995 GMRES Newton No No No

de Sturler and van

der Vorst, 2005 GMRES Chebyshev General No No

Toledo, 1995 CG Monomial Polynomial Yes* No

Chronopoulos and Swanson,

1990

CGR,

Orthomin Monomial No No No

Chronopoulos

and Kinkaid, 2001 Orthodir Monomial No No No

Related Work: -step methods

(28)

Current Work: Convergence Analysis of

Finite Precision CA-KSMs

28

𝐴 𝑍

𝑘, 𝑗

=

𝑍

𝑘, 𝑗

𝒯

𝑘, 𝑗

1

𝛼

𝑠𝑘+ 𝑗

𝑉

𝑘

𝑟

𝑘, 𝑗+1

𝑟

0

𝑒

𝑇𝑠𝑘+𝑗+1

+

𝚫

𝑘, 𝑗

where and

Based on analysis of Tong and Ye (2000) for classical BICG

Goal: bound the residual in terms of minimum polynomial of a perturbed matrix

multiplied by amplification factor

• Finite precision error analysis gives us Lanczos-type recurrence relation:

• Using this, we can obtain the bound

(29)

How can we use this bound?

Given matrix and basis parameters,

heuristically estimate

maximum

such that convergence rate is not significantly different

than classical KSM

Use known bounds on polynomial condition numbers

Experiments to give insight on where

extended precision

should

be used

Which quantities dominant bound, which are most important

to compute accurately?

Explain convergence despite rank-deficient basis

computed by

matrix powers (i.e., columns of are linearly dependent)

Tong and Ye show how to extend analysis for case where has

linearly dependent columns, explains convergence despite

rank-deficient Krylov subspace

29

(30)

30

Residual Replacement Strategy

In finite precision classical CG, in iteration , the computed residual and search direction vectors satisfy

^

𝑟𝑛= ^𝑟𝑛1−𝛼𝑛1 𝐴𝑝^𝑛1+𝜂𝑛 𝑝^𝑛= ^𝑟𝑛+ 𝛽𝑛𝑝^𝑛1+𝜏𝑛

Then in matrix form,

𝐴 𝑍𝑛=𝑍𝑛𝑇𝑛 1

𝛼𝑛

^

𝑟𝑛+1

𝑟^0

𝑒𝑛+1

𝑇

+𝐹𝑛 𝑍𝑛=

[

𝑟^0

𝑟^0

, … , ^

𝑟𝑛

𝑟^𝑛

]

with

where is invertible and tridiagonal, and , with

𝑓 𝑛= 𝐴 𝜏𝑛

𝑟^𝑛

+

1

𝛼𝑛

𝜂𝑛+1

𝑟^𝑛

𝛽𝑛 𝛼𝑛1

𝜂𝑛

𝑟^𝑛

Using the term derived for the CG recurrence with residual replacement, we know we can replace the residual without affecting convergence when is not too large relative to and (Van der Vorst and Ye, 1999).

(31)

31

Residual Replacement Strategy

In iteration of finite precision CA-CG, where was the last residual replacement step, we have

where

Assuming we have an estimate for , we can compute all quantities in the bound without asymptotically increasing comm. or comp.

^

(32)

Residual Replacement Algorithm

Based on the perturbed Lanczos recurrence (see, e.g. Van der Vorst and Ye, 1999), we should replace the residual when

where is a tolerance parameter, chosen as

We can write the pseudocode for residual replacement with group update:

if

break from inner loop end

where we set at the start of the algorithm.

(33)

Avoids communication:

• In serial, by exploiting temporal locality:

– Reading A

– Reading V := [x, Ax, A2x, …, Asx] • In parallel, by doing only 1 ‘expand’

phase (instead of s).

• Tridiagonal Example:

The Matrix Powers Kernel

33

Sequential

Parallel A3v

A2v

Av v

A3v

A2v

Av v

black = local elements

red = 1-level dependencies green = 2-level dependencies blue = 3-level dependencies Also works for

(34)

Current: Multigrid Bottom-Solve Application

AMR applications that use geometric multigrid method

Issues with coarsening require switch from MG to BICGSTAB at bottom of V-cycle

At scale, MPI_AllReduce() can dominate the runtime; required BICGSTAB iterations grows quickly, causing bottleneck

Proposed: replace with CA-BICGSTAB

to improve performance

Collaboration w/ Nick Knight and Sam Williams (FTG at LBL)

CA-BICGSTAB

bottom-solve

Currently ported to BoxLib, next: CHOMBO (both LBL AMR projects)

BoxLib applications: combustion (LMC), astrophysics (supernovae/black hole formation), cosmology (dark matter simulations), etc.

(35)

Current: Multigrid Bottom-Solve Application

35

Challenges

Need only orders reduction in residual (is monomial basis sufficient?)How do we handle breakdown conditions in CA-KSMs?

What optimizations are appropriate for matrix-free method?

Some solves are hard (100 iterations), some are easy (5 iterations)

Use “telescoping” start with s=1 in first outer iteration, increase up to max each outer iteration

– Cost of extra point-to-point amortized for hard problems

Preliminary results:

Hopper (1K^3, 24K cores): 3.5x in bottom-solve (1.3x total) Edison (512^3, 4K cores): 1.6x bottom-solve (1.1x total)

Proposed:

Demonstrate CA-KSM speedups in other 2+ other HPC applications

Want apps which demonstrate successful use of strategies for stability,

(36)

Referencias

Documento similar

In TREBOL, the upstream tree (i.e., the tree used in the forwarding of data packets from the vehicle to the Internet) is built and updated when each node learns about its parent

With the entry into force of the Web Accessibility Directive (2016), the updated Audiovisual Media Services Directive (2018), and the European Accessibility Act (2019), national

Variation of the residual WFE at the LGS Objective image plane (after removal of tip-tilt and focus) as a function of Zenith angle.. The temporal average of the

The meta-exon stores the coordinates of the exons and splice junctions helping HPG Aligner SA to tackle these difficult alignments. In order to do so, the meta- exon is updated at

In this section we analyse the robustness of bagging ensembles in the presence of different rates of label noise with respect to the bootstrap sampling proportion.. Two datasets

Thus, the creation of an unlearning context at time (t 0 ) will result in the absorption of new knowledge through the organizational learning process at a later time (t 1 ).. Thus,

Faculty/Department Faculty of Global Communication Specialized subject at Department of Global Communication Subject number 281201 Course title Japanese Popular Culture..

According to sieve analysis, it was determined that residual molding sand has a grain size that is within technical specifications for direct use in the manufacture