• No se han encontrado resultados

3. LA PROPUESTA

3.4. Factibilidad de aplicación

Reduction operations are required by the algorithm in two locations, for the calculation of the minimum timestep and the generation of intermediate results. Since the timestep value is calculated frequently, it is crucial that a high perfor-

Input: Global Memory ... Core/SMX 0 Local Memory Core/SMX 1 Local Memory Core/SMX 2 Local Memory ... Core/SMXn Local Memory

Intermediate results: Global Memory ...

Core/SMX 0

Local Memory

Output: Global Memory

St a g e 1 St a g e 2

Figure 6.2: OpenCL Reduction Implmentation for GPUs

mance reduction implementation is utilised. As a general optimised reduction operator written in OpenCL is not, at present, readily available an optimised reduction function was developed as part of this research.

Due to the architectural di↵erences between CPU and GPU devices, separate OpenCL reduction functions were developed, and specifically optimised, for each particular architecture. These were implemented as separate OpenCL

kernels and their operation di↵ers significantly from their Fortran and C based equivalents, which either use nested loops to iterate over the entire source array, or OpenMP reduction primitives. Whilst the performance of these kernels is not portable across architectures it makes sense to specialise them, as reduction operations are fundamental to scientific applications and the kernels can be reused within other applications. Ultimately, reduction operations should be provided by a library, and therefore specialising thesekernels should not a↵ect the portability of the actual application code.

GPU Reduction Kernel

The reduction kernel that targets GPU devices (Figure 6.2) is based on work presented by Harris, although his method is generalised as part of this research to handle arbitrary sized arrays [75]. A multi-level tree-based approach is employed in whichkernel launches are used as synchronisation points between di↵erent levels of the tree. The tree continues until the input to a particular

Input: Global Memory

...

...

Core 0 Core 1 Core 2 Core 3 Coren

work-group of 1 work-item work-group of 1 work-item work-group of 1 work-item work-group of 1 work-item work-group of 1 work-item

Intermediate results: Global Memory ...

Core 0 work-group

of 1

work-item

Output: Global Memory

St a g e 1 St a g e 2

Figure 6.3: OpenCL Reduction Implmentation for CPUs

level is small enough to fit within onework-group on the current targetdevice. In the final level a single work-group is launched on one compute unit of the associated device, which subsequently calculates the final result. At each stage

work-items initially read two values from global memory, apply the binary reduction operator to them and store the result withinlocal memory. To enable memory operations to be coalesced and to ensure efficient bandwidth utilisation, theseglobal memory operations are aligned to the preferred vector width of the

device.

A tree-based reduction is then initiated on the partial results stored within thelocal memories. In this phase the number of active threads is halved in each successive iteration, until all of the partial results have been reduced to a single value. To ensure efficient bandwidth utilisation, thelocalmemory references are also arranged to avoid memory bank conflicts. The derived single value is then written by one thread back toglobal memory for the next level of the reduction tree to operate on.

To reduce the number of levels within the tree (and thus the number of

kernel launches) the number of work-items launched within each particular

work-group is maximised. Thus, for each work-group, the number of input values read from global memory into the local memories is also maximised, relative to the single value written back toglobal memory. The implementation ensures that the number of work-items launched for the reduction kernels is always a power of 2, and an exact multiple of the preferred vector width of the

the number of data values read from global memory by the final initiated

work-group. Instead, work-items beyond this limit insert dummy values into their corresponding local memory locations, ensuring that the tree-based part of the reduction is always balanced.

CPU Reduction Kernel

The reductionkernelthat targets CPU devices (Figure 6.3) operates in a similar manner using a two-level hierarchical approach, in which kernel launches are used to provide synchronisation between the levels. In the first level, the input array is partitioned such that it is distributed as evenly as possible across all of the available CPU cores. If required, the last work-groupis again limited to handle uneven distributions of arbitrary sized arrays. Only one work-item is launched for each core of the associated CPU and allwork-groups contain only onework-item. Eachwork-itemthen sequentially reduces the data values within the portion of the input array which is assigned to it, and stores the resultant value back intoglobal memory. The number of partial results output from this phase is therefore equal to the number of cores available on the CPU device.

In the second stage of the reduction only onework-item is launched on one core of the associated CPU. This work-item operates on the array of partial results produced from the previous stage, reducing them sequentially, before outputting the final result. No local memory constructs are employed at any stage of this implementation, as these are generally mapped to the same mem- ory address space as global memory objects on CPU architectures, and their use would therefore potentially result in additional memory operations for no performance benefit.