• No se han encontrado resultados

CAPITULO III. MATERIALES Y MÉTODOS

3.7. ENSAYOS REALIZADOS EN EL LABORATORIO

For correctness and for good GPU throughput, both of the Reduce and Scan primitives need to do the following:

 Map their parallel patterns onto the GPU’s 2-level CTA hierarchy

 Support consecutive access for non-commutative summation

 Support both ILP & TLP

 Mitigate GPU issues of converting between warp and sequential views, bank conflicts, and constraints on occupancy

At a high level, My GPU Reduce primitive is based on the Run-Reduce pattern and is

145

based on the Reduce-then-Scan pattern and is implemented using three GPU kernels — GPU_Reduce, GPU_SumsToStarts, and GPU_Scan. The GPU_Reduce and GPU_Scan kernels do most of the work for each primitive.

The basic idea behind these two implementations, as with so many algorithms, is to divide the large Reduce and Scan problems into smaller instances and apply recursion. This division, however, is complicated by the need to work with the GPUs 2-level CTA hierarchy (grid of thread blocks, threads within a thread block). To support sequential access within the 2-level CTA, I implement the following two-part solution:

CTA Level 1 (grid of thread blocks):

For consecutive data access at the first CTA level, both GPU_Reduce and GPU_Scan use my Row DASk, described in Chapter 5. This DASk partitions data into fixed-size data blocks, which it then distributes across a fixed number r of rows, resulting in c columns. It then assigns each of the r data rows to a thread block in the grid. Each thread block subsequently works along its assigned data row sequentially, data block by block.

CTA Level 2 (threads within a thread block):

Consecutive access within the data block by each thread in the thread block is handled by the BlockReduce and BlockScan methods. Both methods take as input a full fixed-size data block. The 3-level nested BlockReduce method reduces the entire data block to a single block-sum, which is then accumulated into the current row-sum. The 3-level nested BlockScan method sequentially6 scans the entire input block into a scanned prefix sum (inclusive or exclusive as requested), which is updated with the current row- start (missing row prefix). The block of scanned results is then output.

Reduce Overview:

With the GPU_Reduce kernel, each thread block initializes a row-sum to identity and then marches along its assigned data row, block by block, calling BlockReduce on each data block.

6 I would have liked to have skeletonized these methods as sequential block access skeletons (BASks), but because

both methods depend on pass-through parameters from my Row DASk as well as parameters for ILP, TLP, occupancy, and bank conflict mitigation, both methods are too unwieldy to generalize.

146

After each BlockReduce call, each thread block accumulates the resulting block-sum into the current row-sum. The GPU_Reduce kernel generates r row-sums as output that still need to be reduced to the final total-sum.

Since the GPU_Reduce kernel outputs r row-sums to be consumed as input by the GPU_SumRows kernel, I deliberately choose my fixed number of rows (r) to be small (r ≤ 1,000). This allows

GPU_SumRows to reduce the r row-sums to the final sum as output using a single instance of the BlockReduce method with a single thread block (GridSize =1). Calling the GPU_Reduce kernel followed by the matching GPU_SumRows kernel implements the full GPU Reduce primitive.

Efficient I/O access for Reduce:

Given r rows and n input values and assuming r is much less than n (r ≪ n),, then the Reduce primitive needs only a little over one global memory transfer per data warp on average. The GPU_Reduce kernel reads each input exactly once from global memory to reduce n inputs to r row-sums GPU_SumsRows reduces those r row sums into the final sum. Coalescence is respected by using the warp-by-warp BASk to load input. This arrangement means that I only need one transfer per data warp (32 data elements), which results in ⌈32𝑛⌉ total data transfers. Combining transfers from both kernels results in ⌈32𝑛⌉+r+⌈32𝑟⌉+1= (𝑛 + 𝑟) total I/Os for the GPU Reduce primitive.

Scan Overview:

For the GPU Scan primitive, I must generate the missing row prefixes for each data row before locally scanning each data block along each data row. The GPU_Reduce kernel generates r row-sums. The GPU_SumsToStarts kernel exclusively scans the r row-sums as input into r row-starts as output using a single instance of the BlockScan method with a single thread block (GridSize = 1). These r row-starts provide the missing row prefixes for globally correct scan results along each data row. Finally, with the GPU_Scan kernel each thread block loads its missing row-start (row prefix) from the row-starts array and then marches along its assigned data row, block by block, locally scanning each data block by calling BlockScan. The current row-start is also accumulated as a prefix into the local scanned results to generate global scanned results that are then output. After each data block is scanned, the

147

resulting block-sum is accumulated into the current row-start to prepare for the next data block along the row. Calling all three scan kernels in order (GPU_Reduce, GPU_SumsToStarts, and GPU_Scan) implements the full GPU Scan primitive.

Efficient I/O access for Scan:

Given r rows and n input values and assuming r is much less than n (r

≪ n), then the GPU Scan primitive needs a little over three global memory transfers per data warp on average. GPU_Reduce reads each input exactly once to reduce n inputs to r row-sums;

GPU_SumsToStarts exclusively scans these r row-sums into r row-starts; and GPU_Scan reads each row-start. GPU_Scan then reads a data block, combines the row-start with the scan of the data block, and writes out the final prefix sum. I respect coalescence in my GPU_Reduce and GPU_Scan kernels by loading input and storing output using a warp-by-warp view of global memory. This approach means that I only need one transfer per data warp (32 data elements) and results in ⌈3𝑛

32⌉ total data transfers across both kernels (GPU_Reduce and GPU_Scan). Combining transfers from all three kernels results in

⌈3𝑛32⌉+2r+⌈2𝑟32⌉ = (𝑛 + 𝑟) total I/Os for the GPU Scan primitive.

Documento similar