SISTEMA ELECTORAL Y PARTIDOS POLÍTICOS
3. CLASIFICACIÓN DE LOS SISTEMAS ELECTORALES
We next measured the reduction in total computation time and component times for PGAS-FMM when running on different numbers of threads of a single-shared memory node. Figure 5.10(a) shows multithreaded scaling on a single quad-core Sandy Bridge node of the different components of PGAS-FMM for the largest system used in §5.3.2.1(106particles) with p=6 (RMS force error≈1×10−3). Figure5.10(b) presents the same data in terms of parallel efficiency, showing the total thread time (elapsed time× number of threads).
The total time reduces from 19.2 s on a single thread to 5.07 s on 8 threads. On a single thread, the largest components are the ‘downward pass’, which includes near-field interactions, and the multipole-to-local transformations for the V-list. For these components, figure5.10(a)shows a significant reduction in computation time from 1 to 4 threads, while figure5.10(b)shows a slight increase in total thread time reflecting imperfect load balancing due to differences in task size. There is a further reduction in computation time for 8 threads as hyper-threading is used to schedule eight threads on four physical cores. The latter does however increase the total thread time as threads compete for resources. Further experiments (not shown) found no additional performance improvement above 8 threads.
The locality of work stealing applied to activities over the FMM tree can be visualized using the approach that was presented in §3.1.2. Figure5.11 shows the mapping from activity to worker thread for a slice through the simulation space at
§5.3 Fast Multipole Method 103 1 10 1 2 4 8 time (s) number of threads linear scaling total downward m2l upward tree (a) Scaling 0 5 10 15 20 25 30 35 40 45 1 2 4 8
total thread time (s)
number of threads downward m2l upward tree (b) Efficiency
Figure 5.10: Multithreaded component scaling and efficiency of PGAS-FMM on Core i7-2600 (1-8 threads,n=106,p=6,e=10−4).
the lowest level of a uniform tree of Dmax = 4 levels. There is one activity for each box, thus this slice represents 4096 activities. Each worker processes a few contiguous regions of boxes, with the minimum extent of a region in any dimension being 4. Therefore, although work stealing permits fine-grained load balancing of activities between workers, in this application the overall effect is a coarse-grained division of the simulation space, with good locality between the activities processed by each worker.
5.3.2.4 Distributed-Memory Scaling
To evaluate distributed scaling we measured the time for PGAS-FMM force calculation for 1,000,000 particles using different numbers of nodes ofRaijin. The maximum tree depth for this problem size is 5 (32,768 boxes at the lowest level), and p = 6 terms were used in expansions for a force error of approximately 10−4. Strong scaling experiments were also conducted on the Watson 2Q Blue Gene/Q system. Blue Gene/Q represents a different system balance to theRaijinSandy Bridge/IB cluster, with a greater relative performance of the communication subsystem compared to floating-point computation [Haring et al.,2012]. It also has substantially greater levels of parallelism; a single BG/Q compute node may execute up to 64 hardware threads (on 16 4-way SMP cores).
Figure5.12shows the strong scaling measured onRaijin.
Total computation time is shown along with the time for each of the major com- ponents for 1 to 128 places. Total time reduces from 3.6 s on a single place (8 cores)
0 16 32 48 64 0 16 32 48 64 box z index box y index
Figure 5.11: Locality of activity-worker mapping for FMM force evaluation on Core i7-2600 (leaf boxes at x = 3, n = 106, Dmax = 6, X10_NTHREADS=4). Activities executed by each
worker thread are shown in a different color.
0.01 0.1 1 10 1 2 4 8 16 32 64 128 time (s) number of places linear scaling total downward m2l upward prefetch tree
Figure 5.12: Strong scaling of FMM force calculation on Raijin(8 cores per place,n =106, p=6).
§5.3 Fast Multipole Method 105
to 0.19 s on 128 places (1024 cores). Parallel efficiency reduces gradually due to poor scaling of the upward pass, which includes the time to send multipole expansions to neighboring places. The upward pass includes communication and synchronization between each place and its neighbors. Figure5.12shows an additional component, which is the time to prefetch particle data required for near-field interactions at each place. This is insignificant below 32 places but increases to become the second-largest component of the runtime on 128 places.
Figure5.13shows the strong scaling measured onWatson 2Q.
0.01 0.1 1 10 1 2 4 8 16 32 64 128 256 512 time (s) number of places linear scaling total downward m2l upward prefetch tree
Figure 5.13: Strong scaling of FMM force calculation on Watson 2Q (16 cores per place,
n=106,p=6).
For one place (16 cores) Watson 2Qtakes 19.5 s in total, which reduces to 0.28 s on 512 places (8192 cores). The time for a single place (16 cores) is about 4.8 times as long as a single place onRaijin(8 cores). Per core, Watson is therefore more than 9 times slower thanRaijin. The computation overall scales better onWatson 2Qthan it does onRaijin, which reflects the relatively higher performance of communications over the torus network, which makes communication-intensive components like the upward pass relatively cheaper on BG/Q. Also, key collective operations used in tree construction and the upward pass (see §5.3.1.3) are hardware accelerated.
As previously mentioned, an attractive feature ofFMMin comparison to particle- mesh methods is that it uses localized as opposed to all-to-all communication pat- terns. Figure 5.14(a) is a heatmap of the MPI pairwise communications with 64 processes onRaijin for a single FMMforce calculation, profiled using IPM. Darker areas on the map indicate larger volumes of communications between processes. The heatmap demonstrates locality in the communication pattern, with large amounts of data exchanged between neighboring processes and very little between more distant
processes. A noticeable feature of the communication topology is the two strong off-diagonal lines. These represent global tree-structured collective communications (broadcast and all-reduce), which are used in tree construction.6 Figure5.14(b)shows the communication topology for tree construction alone. Comparing the two fig- ures, it is apparent that the communication pattern of FMM evaluation excluding tree construction is fractal-structured and mostly on-diagonal (between neighboring processes).
(a) Complete FMM force calculation (b) Tree construction only
Figure 5.14: Map of MPI communications between 64 processes for FMM force calculation onRaijin.