The OpenACC code used for these tests is based on a serial CPU version implemented using the same techniques, algorithm and data layout, as those we used for the CUDA code. This was done to keep the OpenACC and the CUDA versions as similar as possible, for a more direct comparison. The serial code works in the same way as the CUDA code with the relaxation being applied as a fused operation with one function iterating over all the lattice sites. The communication is handled the same way, with values packed into contiguous buffers before being communicated. As the GPU is a discrete device with its own memory space that is separate from the host CPU, data that needs to be communicated must be moved to and from the device. Figure 4.1 shows the basic program flow together with the necessary additional data transfers between the host and the device.
The idea behind OpenACC is to be able to express what part of the code will be offloaded to the accelerator using simple compiler directives. As with
for timeSteps
Propagate and collide fluid sites Gather border data
Transfer border data to host
MPI communication with other hosts Transfer border data from host Scatter border data
#pragma acc data copy(..) copyin(..) create(..) #pragma acc kernels loop independent #pragma acc kernels loop independent #pragma acc update host
#pragma acc kernels loop independent #pragma acc update device
Figure 4.2: OpenACC directives added to the LB solver. These directives control the data movement and what parts are offloaded to the accelerator.
OpenMP, the code the compiler should work with needs to be placed into a structured block, and this block needs to be annotated with a directive to signal to the compiler what it should do with the code. The directives include hints to as how the compiler should parallelize the code and what parts should be executed on the device.
Data transfers between host and device are also handled using compiler directives. In C code, the data directives are either based on structured blocks or can be added to other directives dictating how the code will be offloaded to the GPU. In the case that they are applied to offload directives, the data transfers will be handled once the execution of the parallel segment starts. When applied to a structured block data transfers will happen as the execution enters and exits the block. The use of data blocks allows the data to be kept on the GPU for longer than just a single parallel loop.
Figure 4.2 shows the practical application of the OpenACC directives to the LB program flow. To minimize the data movement between the host and the device, the entire time stepping part of the code is placed in a #pragma acc data block. In the beginning of the block the indexing data is copied in to the device using the copyin directive, while the fluid data is moved using the copy directive. These directives ensure that the data will be copied to the device, once the execution enters the block, and the fluid data will be copied back to the host once the execution exits the block. Lastly, the additional buffers needed for the communication and the extra lattice data needed by the execution are created using the create directives. The create directive allocates space on the device upon entering the block and then deallocates the data at the end of the block.
The main loops of an LB solver are the ones iterating over all lattice sites and updating them. This loop can be offloaded to the GPU using the #pragma acc kernels loop independent directive. The kernels key- word signifies that the following section should be executed on the acceler- ator as a sequence of kernel operations. The loop construct is added to describe the type of accelerator parallelism to use when executing the iter- ations of the loop. Finally, independent is used to override the compiler dependency analysis of loop dependencies, signaling that the data accesses in the loop are independent. The independent directive is needed in this case, since the loop contains indirect data accesses. These indirect accesses restrict the compilers ability to parallelize the code, since the compiler can no longer guarantee they will not cause any data race conditions.
The loop responsible for the gather and scatter of the data needed for the communication is offloaded to the GPU in the same way as the main loop. Again, since the gather and scatter operations involve some type of indirect data access, the independent directive is needed for them to be executed in parallel on the device.
The transfers of the communication data to and from the device occur in the middle of the data segment used to keep all the simulation data on the device. These communication buffers cannot easily be transferred to and from the device using the standard data movement functionality. OpenACC does provides a way to update either host or device data from within a data segment in the form of the #pragma acc update directive. The update directive allows the programmer to specify an array included in a data segment that is to be copied either to the device or host in the middle of a data segment. This allows the communication data to be moved to the host to hand it off to MPI for the transfer between the different ranks, and then back to the device, once the communication is done.