CAPITULO IV: Comercio entre Ecuador y la Unión Europea
4.3. Balanza Comercial del Ecuador
Following the analysis documented in Section 6.3.1 the most performant OpenCL- based version of CloverLeaf on each particular architecture was subsequently used to conduct an inter-architecture performance comparison on single node instances of each processor type. This enabled the performance of the OpenCL programming model to be objectively assessed across multiple di↵erent archi- tectures and also relative to the native programming models for those particular platforms. In these experiments optimised OpenMP and CUDA versions of the application were utilised as the native programming models on the CPU and Nvidia GPU architectures respectively.
Device OpenCL (s) Native (s) Speedup (%)
Tesla K20X 14.95 13.77 -7.89
Xeon E3-2620⇥2 60.66 52.67 -13.17
Xeon Phi 7120P(2tperC) 58.80 57.03 -3.10
Xeon Phi 7120P(3tperC) 58.80 58.79 -0.01
Xeon Phi 7120P(4tperC) 58.80 66.45 11.51
Opteron 6272 179.78 233.97 30.14
Table 6.6: Runtime of the OpenCL implementation for the 3,8402 problem
These experiments examined the performance (total application wall-time) of the codebase on the Nvidia Tesla K20X, Intel Xeon E3-2620, Intel Xeon Phi 7120P, AMD Opteron 6272, AMD A10-5800K and AMD HD-7660D architec- tures. The Shannon, Tuck, Chilean Pine and Teller platforms were utilised to archive this architectural coverage (see Section A.1 for more details). The 9602 and 3,8402 cell problems from the standard CloverLeaf benchmarking suite were again utilised and executed for 2,955 and 87 timesteps respectively. Tables 6.6 and 6.7 present the results obtained from the experiments with the 3,8402 and 9602 cell problem classes respectively. The approximate memory usage of the 9602 cell problem is 500MB, which means that it is able to fit within the available memory on all of the devices employed in this study. The 3,8402 problem class, however, consumes approximately 5GB of main memory capacity, preventing it from being examined on the AMD A10-5800K and AMD HD-7660D architectures.
The native programming model experiments on the Xeon Phi 7120P platform utilised OpenMP in the“o✏oading”mode configuration and examined the e↵ect on performance of varying the total number of threads as well as the number of threads employed per processing core. The results obtained from the Opteron 6272 architecture were derived from experiments which employed 8 OpenMP threads, i.e. they utilised one thread per floating-point unit within the CPU. Similarly, the experiments on the Xeon E3-2620 architecture utilised OpenMP across both processor sockets and employed one thread per processor core (i.e. the Intel Hyper-Threads within the CPU were not utilised).
The results show that for the 3,8402 cell problem class, the performance of the OpenCL implementation on the Nvidia K20X architecture is not able to match that of the optimised CUDA version, delivering a 7.89% slowdown in relative performance. In the experiments with the 9602 cell problem class, however, the OpenCL version actually delivered a performance improvement of 1.64% over the native CUDA implementation. This performance discrepancy is likely due to the fact that the localwork-group size auto-tuning optimisations were not implemented within the native CUDA version. Collectively, however, both results demonstrate that the OpenCL programming model is able to pro-
Device OpenCL (s) Native (s) Speedup (%)
Tesla K20X 35.88 36.48 1.64
Xeon E3-2620⇥2 166.68 132.77 -20.34
Xeon Phi 7120P(2TperC) 224.47 664.63 66.22
Opteron 6272 16.47 13.76 -16.42
Trinity A10-5800K 947.08 627.06 -51.03
Trinity HD-7660D 678.26 - -
Table 6.7: Runtime of the OpenCL implementation for the 9602 problem
vide broadly equivalent performance to CUDA on processing architectures of this type.
On the Intel Xeon E3-2620 dual CPU architecture the performance of the OpenCL implementation is 13.17% and 20.34% slower than that of the optimised OpenMP version for the 3,8402 and 9602 cell problem classes respectively. In the experiments on the AMD Opteron 6272 CPU architecture, however, the OpenCL implementation was able to deliver superior performance to the OpenMP programming model for the 3,8402 cell problem class, achieving a speedup of 30.14%. Although for the 9602cell problem class the performance of the OpenCL implementation is approximately 16.42% slower than that of the native OpenMP implementation.
The experimental results from the Xeon Phi 7120P platform show significant variations when di↵erent numbers of OpenMP threads are utilised per processing core. In the experiments with the 3,8402 cell problem class, utilising two threads per processor core was the most performant configuration, delivering performance improvements of 14.17% and 2.99% relative to the four and three threads per core configurations respectively. On this platform the OpenCL im- plementation was able to broadly match the performance of the OpenMP version for this problem class. Its performance was only 3.10% slower than that of the OpenMP version in the two threads per core experiment and the performance of both versions was almost identical (within 0.01%) in the three threads per core case. Relative to the OpenMP version (four threads per core), however, the OpenCL implementation delivered a performance improvement of 11.51%. It is not clear how many hardware threads the OpenCL implementation actually utilises, however, these results demonstrate that significant performance benefits could potentially be obtained by restricting their use. In the experiments with the 9602 cell problem class, however, the OpenCL implementation delivered a significant performance advantage of 66.22% (2.96⇥) relative to the OpenMP version. This result together with the observation that performance is generally worse on the Xeon Phi, relative to the K20X architecture, for the smaller 9602 cell problem class (6.3⇥) compared to the larger 3,8402cell problem size (3.9
⇥), indicates that the Xeon Phi is less e↵ective at processing problem configurations
1 2 4 8 16 32 64 128 0 10 20 30 40 nodes wa ll -t im e (s ec s)
Explicit Bu↵er Packing Native Functions
Titan (Cray XK7)
Figure 6.5: Bu↵er packing strong scaling performance (9602 cell problem)
with smaller mesh sizes.
The OpenCL implementation was the only version able to execute on the HD-7660D part of the AMD Trinity APU. Although the performance of the 9602 cell problem class on this architecture was 1.4
⇥better than on the CPU component on the Trinity APU, it was still 18.9⇥slower than the Nvidia K20X architecture.
Overall the Nvidia K20X GPU platform proved to be the most performant architecture for this class of application. In the experiments with the 3,8402 cell problem class and the OpenCL implementation of CloverLeaf, the K20X outperformed the Xeon Phi by 3.93⇥, the dual socket Xeon E3-2620 platform by 4.1⇥, and the single socket Opteron 6272 by 12.0⇥.