IV. LA PANDECTISTICA ALEMANA:
IV.4. IMPORTANCIA E INFLUENCIA EN EL DERECHO ACTUAL. VALORACION Y CRITICA
Optimizing the PRNG is certainly the most important step to reduce the memory usage of Vegas. Having this done, the program is only left with two large blocks of memory: the results of the samples (N times the size of the floating point type) and their positions in respect to the segments for the adaption algorithm (N · d times the size of an integer). Clearly the first one cannot be avoided, while for the second one it is possible, at least for stratified sampling. As shown in figure 2.5 all bins are aligned in the segments. Because we know of every sample its parent bin, we can also deduce its parent segment. Note that this is not possible for the two other internal strategies, as a bin may be part of several segments of even contain them completely. However, for stratified sampling the knowledge of the concrete increment by the sample index makes storing the segment indices superfluous. Additionally reduction is besides matrix matrix multiplication perhaps the classic example for using a GPU at all, which implies possible speed improvements, too. For us this is even more interesting, as the samples are stored on the device anyway. So we do not only benefit by the faster reduction, but also of the reduced amount of memory that has to be copied to the host after the reduction (one floating point type per segment).
Unfortunately it is not that easy. First the classical processing of the bins it not compatible with typical reduction algorithms, which require the data in sequential memory for maximal efficiency. Hence we changed the processing as shown in figure 2.9 for two dimensions. Instead of processing all bins in one dimension, while holding all other bin indices constant, the processing is separated into a processing of segments in the manner as before the bins and a processing of bins in the segments. Hence all data that belong to a segment will reside next to each other in the memory. The penalty of the refined processing sequence is the higher computational cost for calculating the bin index, which should however be negligible for integrands with high computational
101 102 103 104 105 106 107 108 109 1010 100
101 102 103 104 105 106 107
# increments
#binsperincrement
d = 1 d = 2 d = 3 d = 4 d = 5 d = 6
Figure 2.10: Selection of Vegas configurations of bins and increments for d ∈ {1, . . . , 6}.
costs.
The second point to be considered for reducing the data on the device is its utilization and the actual benefit in terms of saved memory transfer in respect to the results of the samples. For this purpose we investigate the actual number of bins per increment depending on the dimension as shown in figure 2.10. The single points represent a possible configuration where stratified sampling can be applied. We selected them by dividing the interval of requested sample points ˜N between two and the limit of unsigned integers on our machine logarithmically. For most configurations every bin contains two samples. For some it might become larger, especially as long as there are very few bins per increment. This causes in figure 2.10 that several points are indistinguishable. Large gaps between points are a quite reliable indication. However, for the following analysis the exact number of samples N does not add any value. Still the product of both axes in figure 2.10 is a good approximation for N. First we can observe that the number of increments in our current implementation is limited to Mmax = 50per dimension. For few samples the number is reduced to align with the bins, approaching the maximal value as the number of samples increases. For all these
103 104 Classic
Refined
Classic
Refined
Classic
Refined
Process time per sample [ps]
single precision double precision finaelhostfinaeldevice GTX680finaeldevice GTXTITAN
Figure 2.11: Execution time of the classic reduction and the refined version. Bin processing is included in the measurement. Further details are provided in the text.
configurations every increment contains only one bin. Assuming two samples per bin, the amount of data that have to be copied to the host does not reduce at all, as the two function values are replaced by an Monte Carlo estimate and its variance. Furthermore, to be efficient, the device has to be fully utilized, which means that all configurations with less than a few thousand bins per increment are likely to be doomed in terms of speed. Hence only for d ≤ 3 there seem to be configurations that may provide additional speed improvement beside the memory benefit.
In our first attempt we implemented the reducer very naive, such that one iterates over all increments, reducing one after another. This is obviously a bad idea for all but the points in the top half in figure 2.10, as it causes at least Md (most likely inefficient) kernel calls followed by small copy instructions. Hence we improved the approach by using the smallest possible multiple of the warp size for a single increment, pooling several increments in one kernel call. This reduces the area, where the device is inefficient due to too small utilization to the bottom left corner in figure 2.10.
To demonstrate the potential use of the refined reduction also in terms of speed under certain circumstances we choose the following setup: the time measurement includes the evaluation of the samples points as well as the reduction. Former is included because we have also to take account for the more complex bin processing. Apart from that the integrand is f(x) = x in one dimension (equivalent to the measurement in the last subsection), such that also the PRNGs computation time is reduced. To focus on
0 50 100 150 200 250 300 350 400 450 500 GSL
finael host
finael device GTX 680
finael device GTX TITAN
Speedup normalized to GSL implementation in double precision
single precision double precision Reduction: Refined
Figure 2.12: Same as figure 2.8, but using the refined reduction for the finael implementations.
the reduction, we increase the number of samples points to 108. The results of the measurement are shown in figure 2.11.
The host system becomes measurable slower, which can, as expected, be ascribed to the more complicated bin processing using several more divisions, which have a especially large computational cost. On the GPUs however the refined reduction pays off, being more than one order of magnitude faster, despite the bin processing penalty.
As we had already seen for the random number generation the otherwise superior GTX TITAN system is slower for the classic reduction as the GTX 680 system due to the slower host processor.
Summarizing we have to record that despite the clear benefit of saved memory the situation is not that easy for the reduction. Its effect on the speed does not only depend on the system but most notably on the specific parameters N and d. If the memory is not the limiting factor the perhaps best policy is to explicitly test all configurations for every use case to determine the fastest possible setup. However, in all cases one will benefit by a more precise result, as the reduction does add values which should potentially be of equal size. In the classic approach it is likely that in a configuration with many samples the values that are added last do not contribute due to the finite precision of floating point numbers.
Concluding we like to examine again the speedup of the setup of the measurements in figure 2.6 and 2.8 using the refined reduction. The results are shown in figure 2.12.
Not visible due to the enormous speedup for the GPUs and expected from the previous discussion is the only slight slowdown for the host versions compared to the ones using the classic reduction, which is approximately a half percent. This basically means:
Nothing changes, the speed of the host version is identical (up to sub permil level) to the one using the GSL implementation. For the GTX 680 we obtain a improvement of about two percent for double and 16 percent for single precision compared to the classic
reduction. This means for that the speedup compared to the GSL implementation increases to more than 65 for double and more than 325 for single precision. For the GTX TITAN the effects are more pronounced, giving a speedup of about eight percent for double and 27 percent for single precision compared to the versions using the classic reduction. This results in the total speedup against the GSL routine of more than 190 for double and more than 460 for single precision. As we now know this large improvement is due to two effects. First the reduction is much faster on the device.
Second less memory has to be copied from the device to the host (100 floating point numbers compared to 105 for the classical reduction). Note that the number of samples is still quite small and we expect an even larger effect if it is increased. Note also that this shows again the shift of the completely compute bound problem on the CPU to partly being memory bound on a GPU.