IV. LA PANDECTISTICA ALEMANA:
IV.2. REPRESENTANTES DE LA ESCUELA
The actual Vegas version we developed is presented in appendix C.6. We will refer to it as ‘finael Vegas’. A special feature of finael Vegas is that it is able to perform the
4Parallel generation of random numbers is not trivial. We will treat this problem in subsection 2.4.2.
System Host only & GTX 680 GTX TITAN CPU Intel Core i5-4460 (Haswell) Intel Xeon E5-2609
Clock rate 3.20 GHz 2.50 GHz
Memory 8 GB 32 GB
Cache 6 MB 10 MB
OS openSUSE Leap 42.2 openSUSE 13.2
Table 2.2: Selection of technical specifications of used hosts.
exact same integration on the host as well as on the device. This not only simplifies debugging of the integrand, which is especially difficult in parallel applications. It guarantees that finael Vegas can be used even if no GPU or a specialized compiler is available. Every C++ compiler should be sufficient. As an independent opponent we have chosen the freely available and widely used Vegas implementation of the GNU Scientific Library (GSL) [149].
To measure the performance of the distinct Vegas versions, we use as measure the time per integration sample. The actual tool is part of finael and presented in appendix C.
Note that, in computing communities two other measures are often favored. The first of them is floating point operations per second (FLOPS). It relies on the idea that a processing unit has a theoretical limit and for a given algorithm one can deduce how good it utilizes the unit. This number is also quiet stable for different processing units.
The second measure is the memory bandwidth, which is especially useful for problems that are not compute but memory bound, which is often the case for algorithms using a GPU. However, both suffer from the problem that one can increase the measure by superfluous instructions. And, more important to us, they require knowledge of the number of FLOPs5 or memory transfers. As Vegas is a general purpose integrator, it is impossible to know how many FLOPs will be executed in the kernel or how many data are copied along with the functor that is integrated. This means although time is in general a more unstable measure, as it depends heavily on the underlying hardware, it will give us a clear picture of the potential of Vegas on a GPU without usurping to be generally true for all given integrands.
The underlying hardware specifications of the host systems are given in table 2.2.
The technical data of the GPUs has already been reported in table 2.1. The compiler and their settings are given in table 2.3.
As first test case we choose the inverse Mellin transform of PDF splines, which are presented in chapter 5. In fact the results in figure 5.5 are obtained using the GPU to be able to use a sufficient number of samples to reduce the variance such that the spline fluctuations are not multiplexed by the noise of Vegas. The integrand seems to be a good representative for a large class of integrands, where a parallelization should
5Note that the ‘s’ is not a capital letter, meaning FLOPs being the plural of FLOP (floating point operation), in contrast to FLOPS (floating point operations per second).
System Host only GTX 680 GTX TITAN Host compiler g++ 6.2.1 g++5.3.1 g++ 4.8.3 CUDA compiler — nvcc release 8.0 nvcc release 7.0
Optimization Level 3 3 3
std C++14 C++11 C++11
arch — sm_30 sm_35
Table 2.3: Compiler versions and flags.
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 GSL
finael host
finael device GTX 680
finael device GTX TITAN
Speedup normalized to GSL implementation in double precision
single precision double precision
Figure 2.6: Speedup against the GSL vegas function of the naive implementation of finael Vegas.
See text for details of the measurement and peculiarities of the different setups.
be considered, as it contains a lot of FLOPs, concentrating the total execution time in this function for a serial application. In the actual measurement we use 105 sample points per integration, which are enough to utilize the device but are not chosen to favor it, which would be the case the the number of samples were a power of the warp size, which is 32. To measure the times we used a simple tool that is presented in appendix C.2 using the std::chrono::steady_clock. The measurement consists of a warm up integration that is ignored and then 20 integrations consisting of eight iterations. The time measured is divided by the total number of evaluated samples for every integration. The averaged results are compared to the times achieved by the GSL routine in double precision, giving the speedup shown in figure 2.6 for single as well as double precision. For scientific applications in most cases only double precision is relevant. However, since the devices of our tests are mainly optimized for single precision, it is worthwhile to consider them too to demonstrate the full potential of GPUs.
The first observation in figure 2.6 is that the host version of finael shows the same performance as the GSL implementation for double precision. For single precision, however, GSL seems to be faster by roughly a factor of two. This is surprising for several reasons. First is is not possible to use the GSL routine in single precision at all.
To perform the measurement we had to cast between the GSL Vegas and the integrand function to achieve single precision execution at least in the integrand function. This means that the routine itself is unchanged. Since the integrand function is identical for all measurements, we would not expect any improvement due to the Vegas routine but only the integrand, which in turn should show up also in the finael host version.
However, for some reason the compiler seems to be able to perform much more efficient optimizations in the precision mixing version of GSL than for the single precision only executions with finael. Now let us turn to the speedup achieved by the GPUs. For the GTX 680 the computation is more than 60 times faster for double and more than 250 times faster for single precision. The large difference is due to the preference of the device for single precision, as discussed before. The GTX TITAN contains a lot more processors designed for double precision, which results in a large gain in this respect.
For double precision it is more than 150 times faster than the host version, for single precision even almost 300 times faster. Albeit we had to change the setup slightly for the GTX TITAN in double precision. It turned out that the block dimension of 512 we used for all other time measurements with finael is not optimal for the given integrand.
Therefore we used a block dimension of 32 instead.