Thoughts on Numerical Integration

(1)

Thoughts on Numerical Integration

Thomas Hahn

Max-Planck-Institut für Physik München

(2)

Restrictions

Integration is a wide field. We will concentrate here on Riemann integrals of the form

I f :_“

ż ₁

0

d^dx f _p~x _q w_p~x _q .

The Weight Function w_p~x _q is generally known analytically and is used to absorb characteristics of the integrand which are difficult to treat otherwise, e.g. peaks or oscillatory behaviour.

(3)

Quadrature Formulas

Task: Find a Quadrature Formula

Q_n f :_“

n

ÿ

i_“1

w_i f_p~x_i_q

with Nodes (sampling points) ~x_i and Weights w_i.

Q_n f should approximate I f for a large class of functions with as small an Error E_n f as possible:

I f _“ Q_n f _` E_n f , E_n f “small.”

But: For a given Q_n, it is always possible to construct an f such that E_n f becomes arbitrarily large!

(4)

Interpolatory Formulas (1 D)

Idea: Approximate f by a polynomial p^pⁿ^´¹^q and integrate the latter. Works as far as f is well approximated by polynomials.

We impose that p^pⁿ^´¹^q interpolates f at n given points x_i: p^pⁿ^´¹^q_px_i_q _“^! f_px_i_q , i _“ 1, . . . , n .

The polynomial thus specified is unique and can explicitly be given in terms of Lagrange Polynomials ℓ_n_´_1,i^:

p^pⁿ^´¹^q_px_{q “}

n

ÿ

i_“1

f _px_i_qℓ^p_iⁿ^´¹^q_px_q , where ℓ^p_iⁿ^´¹^q_px_q _“ ^ź

j_‰i

x _´ x_j x_i _´ x_j . By construction, ℓ^p_iⁿ^´¹^q_px_j_q _“ δ_{i j} .

(5)

Interpolatory Formulas (1 D)

Weights:

I p^pⁿ^´¹^q _“

ż ₁

0

dx w_px_q p^pⁿ^´¹^q_px_q

“

n

ÿ

i_“1

f_px_i_q ż ₁

0

dx w_px_q ℓ^p_iⁿ^´¹^q_px_q loooooooooooomoooooooooooon

w_i

.

From the practical point of view, the weight function w_px_q must be chosen such that these integrals can be computed.

The Degree of Q_n is the degree of the highest polynomial integrated exactly by Q_n.

(6)

Newton–Cotes (Rectangle) Rules (1 D)

The simplest case: take equidistant x_i. For w_px_{q “} 1 the lowest-order rules are:

Open rules: x_i “ _n_`ⁱ ₁, e.g.

Q₁ f “ fp¹₂q, Q₂ f “ ¹₂`

fp¹₃q ` fp²₃q˘ , Q₃ f “ ¹₃`

2fp¹₄q ´ fp¹₂q ` 2 fp³₄q˘ .

Closed rules: x_i “ _nⁱ^´_´¹₁, e.g.

Q₂ f “ ¹₂`

fp0q ` fp1q˘ , Q₃ f “ ¹₆`

fp0q ` 4fp¹₂q ` fp1q˘ , Q₄ f “ ¹₈`

fp0q ` 3fp¹₃q ` 3fp²₃q ` fp1q˘ .

By construction, deg Q_n _ě n _´ 1, but cannot generally be

expected to be larger than n _´ 1 because the x_i are prescribed and only the n w_i have been determined.

The Newton–Cotes rules are not as powerful as the Gauss rules, but can be compounded easily, e.g. in Romberg Rules.

Also, extrapolation algorithms exist.

(7)

Gauss Rules (1 D)

Choose the x_i such that Q_n has the highest possible degree.

With 2n degrees of freedom (n x_i and n w_i), we ought to achieve deg Q_n _“ 2n _´ 1.

By Euclid’s GCD algorithm we write

p^p²ⁿ^´¹^q _“ q^pⁿ^q r^pⁿ^´¹^q _` s^pⁿ^´¹^q , q^pⁿ^q_px_{q “}

n

ź

i_“1

px _´ x_i_q I p^p²ⁿ^´¹^q _“

ż

q^pⁿ^q r^pⁿ^´¹^q _` ż

s^pⁿ^´¹^q

“

n

ÿ

i_“1

w_i q^pⁿ^q_px_i_q lo omo on

“0

r^pⁿ^´¹^q_px_i_{q `}

n

ÿ

i_“1

w_i s^pⁿ^´¹^q_px_i_{q `} E_np^p²ⁿ^´¹^q

If the boxed term were zero, the error term would vanish!

(8)

Gauss Rules (1 D)

ż

q^pⁿ^q r^pⁿ^´¹^q _{“ x}q^pⁿ^q_|r^pⁿ^´¹^q_{y “} 0 _ô q^pⁿ^q _K r^pⁿ^´¹^q

”Functional scalar product: xa|by “ ż ₁

0

dx wpxq apxqbpxq.^ı

Choose Orthogonal Polynomials for the q^pⁿ^q, then

r^pⁿ^´¹^q _“

n_´1

ÿ

i_“0

xq^pⁱ^q_|r^pⁿ^´¹^q_y q^pⁱ^q and _xq^pⁿ^q_|r^pⁿ^´¹^q_{y “} 0 because _xq^pⁿ^q_|q^pⁱ^ăⁿ^q_{y “} 0 . Since q^pⁿ^q_px_{q “} ^śⁿ_i_“₁_px _´ x_i_q, the x_i are just the zeros of the orthogonal polynomials q^pⁿ^q !

(9)

Gauss Rules (1 D)

Gauss rules for particular weight functions have special names, hinting at the orthogonal polynomials used:

w_px_{q “} 1 Gauss–Legendre Rules w_px_{q “} 1_{^a1 _´ x² Gauss–Chebychev Rules w_px_{q “} x^αe^´^x Gauss–Laguerre Rules w_px_{q “} e^´^x² Gauss–Hermite Rules w_px_{q “ p}1 _´ x_q^α_p1 _` x_q^β Gauss–Jacobi Rules

(10)

Error Estimate and Embedded Rules (1 D)

So far we have no error estimate for our integration rule.

Idea: Compare the results of two rules, Q_n and Q_n_`_m.

Use Embedded Rules with _tx_i_u_iⁿ_“₁ _{Ă t}x_i_uⁿ_i_“^`₁^m for economy.

But: e.g. Gauss rules of different n have no common x_i. Seek to add new points x_n_`₁, . . . , x_n_`_m such that Q_n_`_m is

exact for polynomial of maximum degree expectable from #df, p^pⁿ^`^2m^´¹^q. This leads to

ż ₁

0

dx w_px_q v^pⁿ^q_px_q q^p^m^q_px_q r^p^m^´¹^q_px_{q “} 0

with v^pⁿ^q_px_{q “} ^śⁿ_i_“₁_px _´ x_i_q , q^p^m^q_px_{q “} ^ś_i^m_“₁_px _´ x_n_`_i_q . Not all w_px_q have solutions since a scalar product with weight function w_px_q v^pⁿ^q_px_q cannot in general be defined.

(11)

Embedded Rules and Null Rules

Some common Embedded Rules in d _“ 1 are:

‚ Gauss–Kronrod Rules add one point between every two points of a Gauss rule R_n so that R_n _Ă R_n_`_n_`₁, e.g.

0 1

‚ Patterson Rules add points on top of the Gauss–Kronrod rules, e.g. there exists a set of rules

R₁ _Ă R₃ _Ă R₅ _Ă R₇ _Ă R₁₅ _Ă R₃₁ _Ă R₆₃ _Ă R₁₂₇ _Ă R₂₅₅. A Null Rule N_m is associated with an integration rule Q_n

(usually m _ă n) and is engineered to give zero for all functions that are integrated exactly by Q_n.

N_m measures the “higher terms” of the integrand that are not exactly integrated by Q_n and is also used for error estimation.

(12)

Adaptive Algorithms

With an error estimate available, adaptiveness can easily be implemented:

‚ Integrate the entire region: I_tot _˘ E_tot.

‚ while E_tot _ą max_pε_relI_tot,ε_abs_q

‚ Find the region r with the largest error.

‚ Bisect (or otherwise cut up) r.

‚ Integrate each subregion of r separately.

‚ I_tot _“ ^ř I_i, E_tot _“ ^b^ř E²_i .

‚ end while

Examples: QUADPACK’s QAG, CUBA’s Cuhre, Suave.

(13)

Extrapolation

Extrapolation can be used to accelerate convergence if the functional behaviour of the error term is known.

Example: for the Newton–Cotes formulas the error can be shown to vary with the spacing h of the nodes as

Q_n f _´ I f _“ c₂h² _` c₄h⁴ _` . . . , h _“ 1 n .

Idea: Compute Q_n f for different h and extrapolate to h _“ 0. Use Richardson’s Extrapolation to eliminate k powers of h:

T_m^k :_“ 4^kT_m^k^´_`¹₁ _´ T_m^k^´¹

4^k _´ 1 , T_m⁰ :_“ Q₂_m_´₁ f .

T₁⁰ Œ T₂⁰ Ñ T₁¹

Œ Œ

T₃⁰ Ñ T₂¹ Ñ T₁²

Œ Œ

... . ..

These are known as Romberg Formulas.

(14)

Curse of Dimension

Imagine computing the volume of the d-dim.

sphere S_d by integrating its characteristic function χ _“ θ_p1 _{´ }}x_}₂_q inside the surrounding hy- percube C_d _{“ r´}1, 1_s^d.

χ “ 1

χ “ 0

The following table gives the ratio of the volumes:

d 2 5 10 50 100

Vol S_d

Vol C_d .785 .164 .0025 1.5 _ˆ 10^´²⁸ 1.9 _ˆ 10^´⁷⁰ This ratio can in a sense be thought of as the chance that a general-purpose integrator will find the sphere at all!

(15)

Product Formulas

Easiest method: Iterate one-dimensional rules, e.g.

ż ₁

0

d³x f_p~x _{q “}

n_x

ÿ

i_“1 n_y

ÿ

j_“1 n_z

ÿ

k_“1

w^p_i^x^qw^p_j^y^qw^p_k^z^q f_px^p_i^x^q, x^p_j^y^q, x^p_k^z^q_q .

But: the number of samples increases as N _“ ^ś_i^d_“₁ n_i _„ n^d. Consider e.g. the Newton–Cotes rules, where the error term is O_p_h²_{q “} O_p_n^´²_q. Convergence is thus:

Q_n f _´ I f _“ O_p_N^´²^{^d_q _.

Even for moderate dimensions (say d _Á 5), this convergence rate is usually too slow to be useful.

(16)

Construction of Polynomial Rules

Select orthogonal basis of functions _tb₁, . . . , b_m_u (usually monomials) with which most f can (hopefully) be

approximated sufficiently and impose that each b_i be integrated exactly by Q_n:

I b_i _“^! Q_nb_i _ô

n

ÿ

k_“1

w_kb_i_p~x_k_{q “}

ż ₁

0

d^dx w_p~x _q b_i_p~x _q .

These are m Moment Equations for nd _` n unknowns ~x_i, w_i, and a formidable, in general nonlinear, system of equations.

Additional assumptions (e.g. Symmetries) are often necessary to solve this system. If a unique solution exists, Q_n is an

Interpolatory Rule.

Example: the Genz–Malik rules used in CUBA’s Cuhre.

(17)

Monte Carlo Methods

Idea: Interpret f as a Random Variable and estimate I f by the Statistical Average over independent, identically

distributed samples ~x_i _{P r}0, 1_s^d I f _« M_n f _“ 1

n

ÿ

i_“1

f_p~x_i_q _pw_p~x _{q “} 1 here_q . The Standard Deviation is a probabilistic estimate of the integration error:

σ_pM_n f_{q “} b

M_n_pf ²_{q ´ p}M_n f_q² .

From σ_pM_n f _{q “} ^σ^?^p_n^f^q, convergence is M_n f _´ I f _“ O_p_n^´¹^{²_q_. Not particularly fast, but independent of the dimension d !

(18)

Variance Reduction

Variance Reduction = Methods for accelerating convergence.

In Importance Sampling one introduces a weight function:

I f _“

ż ₁

0

d^dx w_p~x _q f _p~x _q

w_p~x _q , w_p~x _{q ą} 0 , I w _“ 1 .

‚ One must be able to sample from the distribution w_p~x _q,

‚ f_{w should be “smooth,” such that σ_w_p f_{w_{q ă} σ_p f_q, e.g. w and f should have the same peak structure.

The ideal choice is w_p~x _{q “ |}f _p~x _q|{If which has σ_w_p f_{w_{q “} 0. Example: Vegas uses a piecewise constant weight function which is successively refined, thus coming closer to _| f_p~x _q|{I f .

(19)

Variance Reduction

Stratified Sampling works by sampling subregions. Consider:

n samples in n_a “ n{2 samples in r_a, total region r_a ` r_b n_b “ n{2 samples in r_b Integral I f « M_n f I f « ¹₂pM_n^a

{2 f ` M^b_n

{2 f q

Variance ^σ_n² ^f ¹₄

ˆσ_a²_f

n{2 ` ^σ_n^b²_{₂^f

˙

“ _2n¹ `

σ_a² f `σ_b² f˘

` “ _2n¹ `

σ_a² f `σ_b² f˘

4n1

`I_a f ´ I_b f ˘₂

The optimal reduction of variance is for n_a_{n_b _“ σ_a f _{σ_b f.

But: splitting each dimension causes a 2^d increase in regions!

Example: Miser uses Recursive Stratified Sampling.

(20)

Variance Reduction

Importance Sampling and Stratified Sampling are complementary:

‚ Importance Sampling puts most points where the magnitude of the integrand _| f_| is largest,

‚ Stratified Sampling puts most points where the variance of f is largest.

(21)

Number-Theoretic Methods

The basis for the number-theoretical formulas is the Koksma–Hlawka Inequality:

The error of every Q_n f _“ ¹_n ^ř_iⁿ_“₁ f _p~x_i_q is bounded by

|Q_n f _´ I f _{| ď} V_p f_q D^˚_p~x₁, . . . ,~x_n_q .

where V is the “Variation in the sense of Hardy and Krause”

and D^˚ is the Discrepancy of the sequence ~x₁, . . . ,~x_n,

D^˚_p~x₁, . . . ,~x_n_{q “} sup

r _{P r}0,1_s^d

ˇ ˇ ˇ ˇ

ν_pr_q

n ^´ Vol r ˇ ˇ ˇ ˇ ,

where ν_pr_q counts the ~x_i that fall into r. For an Equidistributed Sequence, ν_pr_q should be proportional to Vol r.

(22)

Low-Discrepancy Sequences and Quasi-Monte Carlo

Cannot do much about V_pf _q, but can sample with

Low-Discrepancy Sequences a.k.a. Quasi-Random Numbers which have discrepancies significantly below the

pseudo-random numbers used in ordinary Monte Carlo, e.g.

‚ Halton Sequences,

‚ Sobol Sequences,

‚ Faure Sequences.

These Quasi-Monte Carlo Methods typically achieve

convergence rates of O_p_log^d^´¹ _n_{_n_q which are much better than the O_p₁_{^?_n_q of ordinary Monte Carlo.

Example: CUBA’s Vegas and Suave use Sobol sequences.

(23)

Comparison of Sequences

Mersenne Twister Sobol

Pseudo-Random Numbers Quasi-Random Numbers

n = 3000 n = 4000

n = 1000 n = 2000

n = 3000 n = 4000

n = 1000 n = 2000

O_p₁_{^?_n_q O_p_log^d^´¹ _n_{_n_q

(24)

Lattice Methods

Lattice Methods require a periodic integrand, usually obtained by applying a Periodizing Transformation (e.g. x _Ñ 3x² _´ 2x³).

Sampling is done on an Integration Lattice L spanned by a carefully selected integer vector ~z:

Q_n f _“ 1 n

n_´1

ÿ

i_“0

f ^`_t_nⁱ~z _u^˘ , _tx_{u “} fractional part of x . Construction principle for ~z: knock out as many low-order

“Bragg reflections” as possible in the error term:

Q_n f _´ I f _“ ^ÿ

~_k_PZ^d

f˜_p~_k _q _Q_n_e²^πⁱ^~^k^¨^~^x _´ f˜_p~₀ _{q “} ^ÿ

~_k_P_LK,~_k_‰~₀

f˜_p~_k _q _,

where L^K _{“ t}~_k _P Z^d ^:~_k _¨~z _“ 0 _pmod n_qu is the Reciprocal Lattice. Method: extensive computer searches.

(25)

Machine Learning Methods

From arXiv:1707.00028:

Algorithm # of Func. Evals σ^w/ < w > σI/I (2e6 add. evts)

VEGAS 300,000 2.820 ±2.0×10⁻³

Foam 3,855,289 0.319 ±2.3×10⁻⁴

Generative BDT 300,000 0.082 ±5.8×10⁻⁵

Generative BDT (staged) 300,000 0.077 ±5.4×10⁻⁵

Generative DNN 294,912 0.083 ±5.9×10⁻⁵

Generative DNN (staged) 294,912 0.030 ±2.1×10⁻⁵

‚ gains impressive, but: only one

integrand, maximally unsuited for Vegas,

‚ speedup only for variance reduction.

E

N variance reduction

gain statistics

<rant> “Machine Learning” = ignorantly applying a black box (preferably named TensorFlow) to one’s problem in the hope of getting a solution

without really understanding how. </rant>

(26)

Routines in the C

^UBA

Library

Routine Basic method Type Variance reduction Vegas Sobol sample Monte Carlo importance sampling

Suave Sobol sample Monte Carlo globally adaptive subdivision

Divonne Korobov sample Monte Carlo stratified sampling,

or Sobol sample Monte Carlo aided by methods from or cubature rules deterministic numerical optimization

Cuhre cubature rules deterministic globally adaptive subdivision

‚ Very similar invocation (easily interchangeable)

‚ Fortran, C/C++, Mathematica interface provided

‚ Can integrate vector integrands

(27)

Vegas Overview

‚ Monte Carlo algorithm.

‚ Variance reduction: importance sampling.

Ź Iteratively build up a piecewise constant weight function on a rectangular grid.

Margin Sums

Grid minus

linear progression

The grid shows the progression along the respective axis. Progression is slow (i.e. many points are sampled) where the grid’s value is small.

Ź Each iteration consists of a sampling step followed by a refinement of the grid.

(28)

Suave Overview

‚ Monte Carlo algorithm.

‚ Variance reduction: Vegas-style importance sampling combined with globally adaptive subdivision.

Ź Until the requested accuracy is reached, bisect the region with the largest error along the axis in which the fluctuations of the integrand are reduced most.

Ź Prorate the number of new samples in each half for its fluctuation.

(29)

Divonne Overview

‚ Monte Carlo algorithm (+ cubature rules for comparison).

‚ Variance reduction: Stratified sampling.

Ź PHASE 1 – Partitioning

Ź For each subregion, ‘actively’ determine sup f and inf f using methods from numerical optimization.

‚ Move ‘dividers’ around until all subregions have approximately equal spread, defined as

Spreadprq “ 1

2 Volprq´ sup

~xPr

fp~xq ´inf

~xPr fp~xq¯ .

Ź PHASE 2 – Sampling: Sample the subregions independently with the same number of points each. The latter is extrapolated from the results of Phase 1.

Ź PHASE 3 – Refinement: Further subdivide or sample again if results from Phase 1 and 2 do not agree within their error.

(30)

Cuhre Overview

‚ Deterministic algorithm (uses cubature rules of polynomial degree).

‚ Variance reduction: Globally adaptive subdivision.

Ź Until the requested accuracy is reached, bisect the region with the largest error along the axis with the largest fourth difference.

(31)

Mathematica Interface

‚ Used almost like NIntegrate.

‚ The integrand is evaluated completely in Mathematica.

Can do things like

Cuhre[Zeta[x y], {x,2,3}, {y,4,5}]

Mathematica Vegas[ f^,...]

integrand f

(compiled function)

C

void Vegas(...)

request samples MathLink

t~x₁,~x₂, . . . _u tf₁, f₂, . . . _u

(32)

Brainstorming (in lieu of a summary)

1. Physics

‚ How high is “high precision”?

‚ Need in final result? Need in subtractions?

‚ Perhaps try different subtraction scheme (N-jettiness?).

(33)

Brainstorming (in lieu of a summary)

2. Methodology

‚ Maybe only one (or few) dimensions of integration ‘hard’:

Try to do those by 1D methods, e.g.

‚ Weight function absorbs main characteristics (Gauss–XY rules).

‚ Extrapolation methods.

‚ Special algorithms for e.g. oscillatory integrands.

‚ Point out extremal points of the integrand, e.g. in Divonne.

‚ Combine methods, e.g. Lattice Rules “on top of” other methods.

‚ Explore machine-learning methods further.

(34)

Brainstorming (in lieu of a summary)

3. Technology

‚ Parallelize/distribute (CPU, GPU, clusters).

‚ Gauge precision based on (estimate of) central value of each Feynman diagram, i.e. do not waste time on

unimportant parts.

‚ Eventually compute a grid to be used in actual calculations (Monte Carlos etc.).