Optimisation and Performance - Numerical solutions of the general relativistic equations for bl

In any large parallel computer code, it is important to ensure that the best performance possible is being extracted from the hardware. Various tools are available for profiling codes in order to see where most processing time is being consumed. In this project, much use has been made of Valgrind [79], which has the facility to pinpoint the specific lines of source code that are taking the most time.

We note that “Premature optimisation is the root of all evil” [64], but in this case it was necessary to optimize the code before it was fully tested in order that the tests themselves might be completed sufficiently quickly.

Some initial optimizations included using a few lines of code specially written to invert a symmetric 3-by-3 matrix, instead of relying on a more versatile, but more computationally expensive, library rou- tine. It was also necessary to reduce the amount of inter-processor communication being done by the A⁺⁺/P⁺⁺library. This was done using direct memory-access pointers made available by the library rather than other more robust, but slower, routines to access elements of an array of variables distributed across many processors. Also, we confirmed that each processor could, for example, perform a flux-evolution update on its local grid segment, without communication with another processor. This improved the parallel efficiency of the code. For example, before performing a flux update calculation, we first compute all relevant derivatives of the grid mappings, which may require inter-process communication, so that the geometry information is then stored on the processor where it will be required for the flux calculation so that no further inter-processor communication is required for the flux-update.

A small amount of improvement could perhaps have been made if the A⁺⁺/P⁺⁺arrays had been stored in memory using row-major ordering rather than column-major ordering since extracting the variables associated with a single grid points required accessing widely separated memory locations, with the

Procedure Time (s) Proportion of one time-step

Initial grid generation 17.63 79.0%

Initial overlap calculation 7.90 35.4%

Initial data 4.75 21.3%

Solution output 0.78 3.5%

Single time-step 22.32 100%

Calculate ∆t 2.84 12.7%

Advance solution - total time 19.48 87.3%

Boundary conditions 2.39 10.7%

RK2 update 0.95 4.3%

Flux update 14.77 66.2%

Interpolate 1.34 6.0%

Table 6.1: Profiling run for a three-dimensional spherical grid with medium resolution. The run was done on two nodes of the super-computer, using four CPUs per node. Caching and communication overheads may therefore be significant. Note that the first few operations are not performed every time-step, and so their times are not included in the total single time-step.

substantial numbers of cache-misses that implies. Profiling finally suggested that there were no obvious places to reduce run-time, and at this point more effort was put into perfecting the algorithm, rather than speeding up its implementation.

We used the Intel^°^R C/C⁺⁺compiler for our final runs, using the compile parameters:

−O3 −ipo −xhost −fp−model precise −fp−model source,

these having been chosen to provide best speed and accuracy. We note that, presumably due to expression rearrangements performed by the optimizer, the-fp-modeloptions were necessary to retain the numerical accuracy of no optimization, even when using -O3 optimization. This was evident in some tests of recovering primitive variables from conserved variables for velocities close to the speed of light. We have used double precision arithmetic throughout, as it is likely that the numerical methods, particularly the primitive variable recovery, will need this level of accuracy.

We performed a profiling run of our code, to see where most of the computational time was spent.

This run was done for the base model UB1 run on a medium resolution grid as described in Table 7.2.

The timings are shown in Table 6.1.

We see that the flux calculations are by far the most computationally expensive part of the code. The

Nodes×Processors per node Time(s) Speed-up Efficiency

1×1 31254.9 1.00 1

1×2 17441.9 1.79 0.896

2×1 17262.6 1.81 0.905

1×4 11247.6 2.78 0.695

2×2 10257.8 3.05 0.762

2×4 7130.31 4.383 0.550

Table 6.2: Speed-up statistics for the code. This was a three-dimensional run using the low resolution grid over 1000 time steps. The speed-up is the ratio of the time taken to run the code in parallel to that taken for a serial run. The efficiency is the speed-up divided by the number of processors.

next most expensive part is the calculation of the time-step which requires a change of coordinate basis to get the correct time-step for a curvi-linear grid, which accounts for its relatively large contribution to the CPU time.

The interpolation is also fairly time consuming. This is due to the fact that we use fifth-order Lagrange interpolation. This requires information from 125 cells per interpolation point. Due to the column-major ordering used by Overture, there will be substantial caching overheads involved here. Also, we have interpolated the metric values as well as the fluid values, so there is some unnecessary overhead involved here. The RK2 update is entirely local to each cell, and requires no geometric data to evaluate, and so takes the least time.

We also tested our code to ascertain how well it had been parallelised. We ran exactly the same model parameters in each case on various combinations of processors. The super-computer setup² is such that there are 4 CPUs per node, with 8GB of RAM per node, so that a reasonably sized problem could be run even on only 1 processor, since it had 8GB available to it. We ran our simulation for 1000 time steps, and recorded the total time for the program to run. The timings can be seen in Table 6.2.

The speed-up from one to two processes is encouraging, and suggests a reasonable level of paralleli- sation. However, the subsequent speed-up for four processes on one node is less so. It is likely that this is due to the super-computer’s architecture. Each node consists of two dual-core processors, so that, when we go from one to two processes, each process sits on a different processor, with separate caches.

However, four processes will require that two processes sit on each processor, which will result in less performance gain due to clashes between the cache requirements of each core. This is particularly noticeable when comparing four processes on one node (efficiency 0.695), and two processes on each of two nodes

2We used the Darwin supercomputer of the University of Cambridge High Performance Computing Service. Details of its architecture can be found at http://www.hpc.cam.ac.uk/services/darwin.html.

(efficiency 0.762), so that the caching problems outweigh any communication overhead between nodes.

However, for the grid resolutions we have used, the efficiencies and timings demonstrated here are sufficient to allow us to perform the validation and studies we need in an acceptable length of time, and therefore we do not seek any further optimizations.

6.3 AMR

Our AMR routines work in parallel, and maintain the convergence and accuracy shown in §5.2.2.

However, the efficiency is not sufficient to give any appreciable performance gain over running on a uni-grid resolution. We therefore have not used AMR in any of our subsequent testing.

The times taken for a run including AMR are shown in Table 6.3. These were generated for a simulation of UB1 on a grid of slightly lower resolution than the lowest given in Table 7.2, withn= 30, and with a single level of refinement factor two, making it equivalent to the high-resolution grid for n= 60. The code was run on two nodes with four processes running on each, so that caching overheads are probably significant here.

For the time-step that we used for these timings, the coarsest level contained 470 862 cells and the refined level contained 988 828 cells. This would suggest that advancing the refined level would take approximately

470682

2×988828 + 470862= 80.8% (6.1)

of the time, which is very close to the actual value found, suggesting that AMR does not add overhead disproportionately to different refinement levels.

We see that the interpolation at level 1 is very time consuming, almost as much as the flux update.

The interpolation contributes much more to the CPU time than for a uni-grid run; this is presumably due to the extra effort required in checking the interpolation points for refined grid levels.

Procedure Time (s) Proportion of one time-step

Initial grid generation 3.30 8.0%

Initial overlap calculation 1.96 4.7%

Initial data 1.42 3.4%

Solution output 1.08 2.6%

AMR regrid 31.57 76.3%

Single time-step 41.36 100%

Calculate ∆t 1.81 4.4%

Level 0 - total time 6.07 14.7%

Boundary conditions 0.91 2.2%

RK2 update 0.35 0.8%

Flux update 3.93 9.5%

Interpolate level 0 0.50 1.2%

Store level 1 boundary cells 0.38 0.9%

Level 1 - total time 33.21 80.3%

Boundary conditions 0.15 0.4%

RK2 update 1.16 2.8%

Flux update 18.23 44.1%

Interpolate level 1 13.67 33.1%

Interpolate level 0 cells from level 1 0.27 0.7%

Table 6.3: Profiling statistics for a three-dimensional run including AMR. The first few operations are not performed every time-step and so their times are not included in that for a single step. Also, although we do two RK2 updates per level, these are all included in a single time, and both level 1 sub-steps are included in the level 1 update time.

In document Numerical solutions of the general relativistic equations for black hole fluid dynamics (página 131-135)