The multi-threaded placement penalties defined by Equation 5.8, requires a scaling factor fnto
be determined. These values are obtained using specific thread and memory placement timing experiments, where the requisite data quantities are local to all the threads for both serial and parallel computation. This required a modified version of Gaussian, where each thread uses a node-local copy of both the Density and Fock matrix. (The Density matrix is normally shared by all threads).
Results for these timing experiments are given in Table 5.9. This table gives the average speedup ‘Sn’ and its standard deviation (σ) for calculations with 2, 4 and 8 threads. Results
Table 5.9: The average Speedup Sn and its standard deviation (σ) for n-thread calculations with ideal local access to both Density and Fock matrices/blocks.
n-thread PRISM PRISMC CALDFT PRISM PRISMC CALDFT
from Exp 1 from Exp 2 from Exp 2 from Exp 3 of Exp 4 of Exp 4 n = 2 S2 1.941 1.848 2.004 1.979 1.906 2.022 σ 0.001 0.001 0.001 0.001 0.006 0.220 n = 4 S4 3.863 3.367 4.008 3.929 3.700 3.979 σ 0.022 0.010 0.009 0.005 0.027 0.018 n = 8 S8 7.548 6.088 7.477 7.728 7.176 7.799 σ 0.002 0.073 0.030 0.009 0.079 0.018
for PRISM, PRISMC and CALDFT for Exp 1 – 4 are given. The results were obtained for computations run with Nodes 2, 3, 4 and 5 (cf. Figure 5.2) of the SunFire X4600 M2 and with a single core per node. For the 8 thread results, all 8 nodes were used, but jobs were executed twice. The first time this was done using threads 1 – 4 running on Nodes 2 – 5. The second time, threads 5 – 8 were run on Nodes 2 – 5. Speedups were obtained by measuring the time taken for each thread. By default, dynamic load balancing of work is used in PRISM, PRISMC and CALDFT. For the purpose of this work a static load balancing scheme was used in order to avoid skewing fndue to work shifting between threads.
For the PRISM and PRISMC subroutines which compute ERIs, the speedups given in Table 5.9 are less thann. This is a result of replicated work being done in these routines, e.g. all threads need to calculate ERI quantities relating to pairs of basis functions. The standard deviation (σ) is small indicating that static load balancing is good. The CALDFT routine in some cases gives a slight superlinear speedup, and for the 2 thread case and Exp 2 there is a large standard deviation. The latter indicates that static load balancing is poor and this was confirmed to be the case. (Note load balancing had been deliberately disabled as noted previously).
5.5.3.1 Two Threads, Single Core Thread Assignment
Using the values of ‘fn’ presented in Table 5.9 we can now derive execution times for multi-
threaded Gaussian calculations, with specific thread and memory placement by using Equation 5.8. In this section we consider the case where only a single core is used at each node.
We consider first the case of using two threads executing on Nodes 2 and 5 with 1 thread per node. Timing predictions for various threads are given in Table 5.10.
The table is divided into two sections corresponding to thread 1 and thread 2. Variations in hops are in columnshDandhF. The measured execution time for each thread is given for
the 4 experiments and different routines. In addition, the percentage error that the extended LPM gave is reported. In each case if there were multiple ways of performing a placement
Table 5.10: Modelling error in percent for 2-thread calculations performed at each NUMA level using fn, for single core thread assignment
Exp 1 Exp 2 Exp 3 Exp 4
PRISM PRISMC CALDFT PRISM PRISMC CALDFT Thread hD hF TMeasured Err% TMeasured Err% TMeasured Err% TMeasured Err% TMeasured Err% TMeasured Err%
1 0 0 973 -0.2 316 -0.1 485 -0.2 543 -0.2 257 -0.2 255 -8.7 0 1 993 -0.3 324 -1.0 520 -0.1 552 -0.2 261 -0.9 273 -8.6 0 2 1016 -0.6 332 -1.9 544 -0.8 563 0.2 265 -1.5 286 -9.0 2 1 0 1001 -0.5 320 -0.6 482 0.7 557 -0.6 257 -0.2 220 6.2 2 0 1042 -1.0 324 -1.0 484 1.9 573 -0.5 258 -0.4 221 6.3 1 1 1020 -0.5 328 -1.6 516 0.9 565 -0.4 261 -0.7 234 7.1 2 1 1058 -0.7 331 -1.8 517 1.4 582 -0.5 262 -0.8 234 7.6 1 2 1045 -0.9 336 -2.3 538 0.8 575 -0.2 265 -1.4 244 7.0 2 2 1087 -1.5 342 -3.4 546 -0.2 595 -0.7 268 -1.7 247 6.1 experiment, then each of these was timed and the lowest value reported.
In all experiments thread 1 was bound to Node 2 and the shared Density matrix was placed in MEM2. Thus the values forhDare always zero for thread 1. The Fock matrices for the
two threads were allocated on any of the following Nodes: 2, 3, 4, 5. For thread 1 the Fock matrix is varied to be 1 or 2 hops away by allocating it on Nodes 5 or (3 or 4) respectively. This corresponds to entries (0,1) and (0,2) for thread 1. For thread 2, the Density matrix can only be either 1 or 2 hops away depending if thread 2 is running on nodes 3, 4 or 5. This gives rise to six possible entries in Table 5.10.
Examining the errors across all the experiments we find that apart from CALDFT in Exp 4, which is known to have load balancing issues, the maximum errors obtained by use of the extended LPM and fnis less than 2%.
5.5.3.2 Four and Eight Threads, Single Core Thread Assignment
In this sub-section we extend the 2 thread experiments detailed above to 4 and 8 threads but still use single core thread assignment.
This time the modelling errors are presented graphically in Figure 5.5. This figure is composed of 6 plots, with results for PRISM, PRISMC and CALDFT ordered by row. The figures are labelled (a) to (f), where each corresponds to the following: (a) PRISM from Exp 1; (b) PRISMC from Exp 2; (c) CALDFT from Exp 2; (d) PRISM from Exp 3; (e) PRISMC from Exp 4 and (f) CALDFT from Exp 4.
Each sub-plot in Figure 5.5 has error bars to indicate the difference in thread timings as a result of load imbalance between threads. The x-axis is the number of hops for the Density and Fock matrices (hD,hF) and the y-axis denotes modelling error (expressed as a percentage).
Figure 5.5: Modelling error for 4, 8 thread calculations at each NUMA level corresponding to: (a) PRISM from Exp 1; (b) PRISMC from Exp 2; (c) CALDFT from Exp 2; (d) PRISM from Exp 3; (e) PRISMC from Exp 4 and (f) CALDFT from Exp 4
-25 -20 -15 -10 -5 0 5 (0,0) (0,1) (1,0) (0,2) (1,1) (2,0) (1,2) (2,1) (2,2) Modelling Error (%) NUMA Level (hD, hF)
PRISM from Exp_1 4 Threads
8 Threads -25 -20 -15 -10 -5 0 5 (0,0) (0,1) (1,0) (0,2) (1,1) (2,0) (1,2) (2,1) (2,2) Modelling Error (%) NUMA Level (hD, hF)
PRISM from Exp_3 4 Threads
8 Threads (a) (d) -25 -20 -15 -10 -5 0 5 (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) Modelling Error (%) NUMA Level (hD, hF)
PRISMC from Exp_2 4 Threads
8 Threads -25 -20 -15 -10 -5 0 5 (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) Modelling Error (%) NUMA Level (hD, hF)
PRISMC from Exp_4 4 Threads
8 Threads (b) (e) -25 -20 -15 -10 -5 0 5 (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) Modelling Error (%) NUMA Level (hD, hF)
CALDFT from Exp_2 4 Threads
8 Threads -25 -20 -15 -10 -5 0 5 (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) Modelling Error (%) NUMA Level (hD, hF)
CALDFT from Exp_4 4 Threads
8 Threads
Table 5.11:Modelling error in percent for 2 thread calculations using fnand dual-core thread
assignment
Exp 1 Exp 2 Exp 3 Exp 4
PRISM PRISMC CALDFT PRISM PRISMC CALDFT Thread hD hF TMeasured Err% TMeasured Err% TMeasured Err% TMeasured Err% TMeasured Err% TMeasured Err%
1 0 0 981 -1.0 320 -1.3 320 -3.1 545 -0.5 258 -0.7 261 -8.9 0 1 1004 -1.4 329 -2.3 329 -2.9 555 -0.7 263 -1.4 280 -8.8 0 2 1030 -1.9 340 -4.3 340 -5.7 568 -0.9 268 -2.4 299 -11.1 2 0 0 982 -1.5 321 -1.6 321 -2.1 546 -0.7 257 -0.5 226 5.2 0 1 1002 -1.2 327 -1.9 327 -0.5 554 -0.5 261 -0.9 238 7.3 0 2 1024 -1.4 336 -3.0 336 -1.9 564 -0.3 266 -1.5 250 6.2 Three common characteristics are evident in the plots: first, the majority of predicted dif- ferences are negative. This is due to interconnect contention not being explicitly included in Equation 5.8. Second, the modelling error is much less for 4-threads than for 8-threads. This reflects the fact that interconnect contention increases with increased threads. Third, the mod- elling errors increase slightly with larger NUMA levels. This is to be expected as the model attempts to predict performance at greater hop counts using execution times obtained from those of lower hop counts.
Overall the majority of predictions are able to reproduce elapsed times to within 5%. The largest differences are seen for the PRISMC 8 thread results from Exp 2 indicating that inter- connect contention is limiting parallel performance.