3. LA PROPUESTA
3.3. Desarrollo de los Objetivos
3.3.1. Analizar la Balanza Comercial Ecuador – Los Países Bajos en el
This section analyses the performance results obtained during the first phase of the PGAS experiments.
Communications Bu↵er & Array-sections Approaches
The results from the experiments with the PGAS versions, which employ either the communications bu↵er or array-sections data exchange approaches, are shown in Figure 5.1. These charts show the positive e↵ect which employing communications bu↵ers can have, particularly at high node counts, on both the Spruce and Archer platforms. In the 2,048 node experiments on the Spruce plat-
form the OpenSHMEM version, which employs communication bu↵ers, achieved an average of 278.14 iterations/sec. An improvement of 1.2-1.3⇥ over the equivalent array-section based approaches, which achieved 224.31 and 209.33 iterations/sec. The OpenSHMEM and CAF results from Archer also exhibit a similar pattern, at 2,048 nodes (49,152 cores) the communications bu↵er based OpenSHMEM version achieved 197.49 iterations/sec. Compared to the equiv- alent array-section based approaches which achieved only 159.23 and 163.24 respectively, an improvement of up to 1.24⇥. The CAF-based versions exhibit a significantly larger performance disparity, with the communication bu↵ers approach achieving 3.4⇥the performance of the array-section based approach, the results show that these achieved 68.04 and 19.95 iterations/sec respectively in these experiments.
This demonstrates that the performance of applications, implemented within either the OpenSHMEM or CAF PGAS models, can be significantly improved through the aggregation of communication data into larger transmission bu↵ers, rather than moving data directly from its original memory locations using considerably larger volumes of smaller messages and potentially strided memory operations.
Co-array Object Selection Options
These results also show (Figure 5.1) the performance improvement delivered by moving the data field definitions from within the original Fortranderived data type, which was originally defined as aco-array, to be individual top-level data structures, each separately defined as co-array objects. This optimisation (la- beledFTL) improves the performance of the CAF array-section based approach by 3.39⇥ (from 19.95 to 67.70 iterations/sec) at 2,048 nodes on Archer. It also enabled the array-section based approach to deliver equivalent performance to the communications-bu↵er based approach in the 1,024 and 2,048 node experiments, and to slightly exceed it in the 64 to 512 nodes cases.
To ascertain the cause of this performance disparity a detailed inspection of the intermediate code representations, produced by the Cray compiler, was conducted. This indicated that this disparity is due to the compiler having to make conservative assumptions regarding the calculation of the remote addresses of theco-array objects on the remote images. For each remote“put” operation within theFTL version of the code, the compiler produces a single loop block containing one pgas memput nb and one pgas sync nb operation. In the original array-exchange version, however, the compiler generates three addi- tional pgas get nband pgas sync nboperations prior to the loop containing the“put”operation, with a further set of these operations within the actual loop
64 128 256 512 1024 2048 0 50 100 150 200 nodes iterations / sec MPI
Shmem buffers shmemwait CAF buffers Archer (Cray XC30) 64 128 256 512 1024 2048 0 100 200 300 nodes iterations / sec MPI
Shmem buffers shmemwait
Spruce (SGI ICE-X)
Figure 5.2: Equivalent MPI, OpenSHMEM and CAF performance
and an additional nested loop block containing a pgas put nbioperation. Unfortunately the precise functionality of each of these operations is not clear, as Cray does not publish this information. This analysis, however, appears to indicate that the compiler is forced to insert additional“get” operations due to the extra complexity (i.e. the additional levels of indirection involved) of the original data structures. These additional operations are required to retrieve the memory addresses from the remote images, to which a particular image should write the required data to, despite these addresses remaining constant throughout the execution of the program. The creation of an additional compiler directive may therefore prove to be useful here, as it would enable developers to inform the compiler that the original data structure remains constant, and therefore allow it to make less conservative decisions during code generation. PGAS and MPI Performance Comparison
Figure 5.2 presents the results from the experiments conducted to assess the performance of the PGAS implementations relative to equivalent MPI-based versions. These charts document a significantly di↵erent performance trend on the two system architectures examined here. The performance recorded on Spruce from both the OpenSHMEM and MPI implementations is virtually identical at all the node counts examined (64 to 2,048 nodes), reaching 278.14 and 276.49 iterations/sec respectively on 2,048 nodes. On Archer, however, the
64 128 256 512 1024 2048 0 50 100 150 200 nodes iterations /
sec Shmem buffers global Shmem buffers shmemwait CAF buffers global CAF buffers Archer (Cray XC30) 64 128 256 512 1024 2048 0 100 200 300 nodes iterations / sec
Shmem buffers global Shmem buffers shmemwait
Spruce (SGI ICE-X)
Figure 5.3: Local & global synchronisation approaches
performance of the two PGAS versions is not able to match that of the equivalent MPI implementation, with the performance disparity widening as the scale of the experiments is increased. The OpenSHMEM implementation delivers the closest levels of performance to the MPI implementation and also significantly outperforms the CAF-based implementation. The results show it achieving 197.49 iterations/sec on 2,048 nodes compared to 230.08 iterations/sec for the MPI implementation, an improvement of 1.17⇥. The CAF implementation, however, only delivers 68.04 iterations/sec on 2,048 nodes a slowdown of 2.9⇥
relative to the equivalent OpenSHMEM implementation.
Synchronisation Approaches
To assess the e↵ect of employing either theglobalorpoint-to-point synchronisa- tion constructs on the performance of the PGAS versions, the results obtained from experiments on both platforms involving versions which employed the com- munications bu↵er data exchange approach together with either synchronisation construct, were analysed. The OpenSHMEM versions examined here utilised the shmemwait approach to implement the point-to-point synchronisation op- erations. Figure 5.3 provides a performance comparison of the results obtained from the experiments with each of these versions.
On both platforms it is clear that employingpoint-to-point synchronisation can deliver significant performance benefits, particularly as the scale of the
64 128 256 512 1024 2048 0 50 100 150 200 nodes iterations /
sec buffers shmemwait dc mf buffers shmemwait dc mf quiet buffers volatilevars dc mf buffers volatilevars dc mf quiet
Archer (Cray XC30) 64 128 256 512 1024 2048 0 100 200 300 nodes iterations /
sec buffers shmemwait dc mf buffers shmemwait dc mf quiet buffers volatilevars dc mf buffers volatilevars dc mf quiet
Spruce (SGI ICE-X)
Figure 5.4: SHMEM volatile variables & fence/quiet optimisations
experiments is increased. At 64 nodes there is relatively little di↵erence between the performance of each version. On Spruce (1,280 cores) both OpenSHMEM implementations achieve 9.13 and 8.90 iterations/sec respectively, whilst on Archer (1,536 cores) the point-to-point synchronisation versions of the Open- SHMEM and CAF implementations achieve 9.84 and 7.73 iterations/sec respec- tively. Compared to the equivalent global synchronisation versions which each achieve 9.34 and 7.31 iterations/sec respectively. At 2,048 nodes (40,960 cores) on Spruce the performance disparity between the two OpenSHMEM versions increases to 278.13 and 159.97 iterations/sec respectively, a di↵erence of ap- proximately 1.74⇥. On Archer, however, the performance disparity between the OpenSHMEM versions is even greater reaching 2.10⇥in the 2,048 node (49,152 cores) experiments, 197.49 and 93.91 iterations/sec were recorded respectively. Interestingly the CAF-based versions do not exhibit the same performance di↵erences, with the point-to-point synchronisation version achieving only a 1.29⇥improvement (68.04 and 52.84 iterations/sec respectively). This indicates that the performance of the CAF-based versions—which is significantly less than the OpenSHMEM-based versions—is limited by another factor and therefore the choice of synchronisation construct has a reduced, but still significant, e↵ect on overall application performance.
64 128 256 512 1024 2048 20 40 60 80 nodes iterations /
sec CAF buffers dc mf CAF buffers dc mf defer CAF buffers dc mf overlap CAF buffers dc mf overlap defer
Archer (Cray XC30)
Figure 5.5: CAFpgas defer syncconstruct & communication overlap
Remote Memory Operation Ordering Constructs
The performance results obtained from several alternative versions of the Open- SHMEM implementation, on both the Cray and SGI platforms, are presented in Figure 5.4. These charts compare versions which employ either the shmemwait
or volatile variables synchronisation techniques and either the quiet or fence
remote memory operation ordering constructs. All versions examined in this chart employ a communications bu↵er-based approach to data exchange, as well as implementing diagonal communications (Section 4.2.1) between logical neighbouring processes (denoted by the acronymdc within their descriptions). They also exchange multiple data fields simultaneously (Section 4.2.1), which is indicated by the acronym mf within their descriptions. It is evident from the charts that in these experiments the choice of each of these implementation approaches has no significant e↵ect on overall performance. The results from both platforms show very little variation in the number of application iterations achieved per second as the scales of the experiments are increased. Although the Cray results do show some small variations in the higher node count experi- ments, this is likely to be due to the e↵ects of system noise arising from the use of a non-dedicated system.
Figure 5.5 documents the results obtained on the Archer platform from experiments with the CAF versions which employ the proprietary Cray pgas
defer sync directives and the optimisations to enable communication oper-
ations to be overlapped with computation. The chart presents these results together with an equivalent CAF-based version which does not utilise any of these constructs. This shows that in these experiments the overall performance of CloverLeaf is not significantly a↵ected (beneficially or detrimentally) by either of these potential optimisations techniques, as the performance of all four versions is virtually identical in each of the examined cases.
128 256 512 1024 2048 0 50 100 150 200 250 nodes iterations / sec MPI
Shmem buffers shmemwait Shmem buffers shmemwait nb Shmem buffers shmemwait 4MHP CAF buffers
CAF buffers FTL
Archer (Cray XC30)
Figure 5.6: SHMEM non-blocking, huge-pages & CAF FTL