7. Otros resultados derivados del estudio de puntos brillantes
7.2. Variaci´ on del centro al limbo de la fracci´on de superficie
GPGPU, which refers to general purpose computing on graphics pro- cessing units, is a relatively new way to use graphic cards to solve problems that can also be unrelated to graphics and have traditionally been handled by the CPU. One of the characteristics of GPGPU is the
Figure 1.10: Multithreaded computation with OpenMP. To elaborate a par- allel task, a master thread generates a number of worker threads. [7]
possibility to transfer data bidirectionally between CPU and GPU with high throughput.
While early attempts to solve general purpose problems with GPUs had to reformat the problem in term of graphics primitives, recently general purpose parallel programming frameworks which greatly sim- plify GPU programming have been developed.
The leading open source GPGPU computing framework is OpenCL, which is backed by the non-profit consortium Khronos Group, while the most widely adopted proprietary software developer platform is Nvidia CUDA.
OpenCL is not limited to GPUs but can also be used to write code that needs to be executed on heterogeneous platforms that may include CPUs, GPUs, digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other accelerators.
OpenCL is composed of a C-like programming language and an Application Programming Interface (API) which allow developers to take advantage of the various CPUs and accelerators present in the system.
The CUDA Toolkit is highly optimized but can only be used with Nvidia GPUs. Compilers, libraries and developer tools are included in the toolkit distribution. With CUDA, it is possible to program GPUs in C, C++, Fortran and a few other laguages.
Nvidia Kepler Architecture
A peculiar characteristic of CUDA capable GPUs when compared to CPUs is the very large proportion of die area which is dedicated to
computation rather than control logic and caching (Figure 1.11). The reason is that CPUs are optimized for serial computation and complex control logic is necessary to execute instructions in parallel or out of order while mantaining the appearance of serial execution and caching is needed to reduce the memory access latency[30]. GPUs are instead focused on maximizing the performance of floating point calculations and, to minimize the requirement of control logic, the scheduler picks from a large pool of computing threads to assign work when some wait for memory accesses. Small caches are used to avoid multiple accesses to global memory if multiple threads need to access the same memory location.
Figure 1.11: Die occupation difference between CPUs and GPUs.
Nvidia GTX Titan Blacks are based on the Kepler GK110 archi- tecture, which includes 15 streaming multiprocessors (SMX) and 6x64 bit memory controllers. When compared to the previous iteration, the Fermi architecture, it proves to be much more power efficient, with 3x the performance per watt. Kepler implements full IEEE 754-2008 compliant single and double precision arithmetic.
Each SMX contains 192 single precision CUDA cores, 64 double precision units, 32 special function units (SFU) for fast approximate trascendental operations, and 32 load/store units (LD/ST). Threads are scheduled in groups of 32 named warps and each SMX includes four warp schedulers that allow for four warps to execute concurrently. Developers can avoid to think about warp execution from an output correctness standpoint but computational performance is significantly improved if threads in the same warp execute the same code path and access contiguous memory.
CUDA Programming Model
The portion of code that runs in parallel on the GPU cores is called a kernel. Kernels invoked by CUDA code execute on a generally large number of parallel threads and each thread executes an instance of the kernel. Threads must be organized by the developer into thread blocks, which compose a grid of threads as illustrated in Figure 1.12.
Figure 1.12: Thread hierarchy. Threads are grouped in blocks, which com- pose a grid.
Threads have their own private memory and registers but have also access to a per-block shared memory region that can be used for inter- thread communication or data sharing, a global device memory region generally used to read inputs and write results and a constant memory region that can only be read by code running on GPUs Figure 1.13.
To perform computations on the GPU, the following steps are re- quired:
Figure 1.13: CUDA memory structure from a developer point of view. Dif- ferent threads have access to different memory regions. While all threads have access to global memory and shared memory of the block to which they belong, each thread also has access to a unique small portion of private memory and registers.
• Copy input data from host memory (RAM) to device (global) memory;
• Launch the kernel;
• Copy back the results from device memory to host memory. Since CUDA 6, a unified memory space that includes boh RAM and device memory is available but manual management of device memory is still more efficient thus it is adopted throughout this work.
Threads are then mapped to physical processors on the GPU. Each GPU can execute more than one kernel grids (if enough resources are available), a streaming multiprocessor (SMX) can execute one or more blocks and individual CUDA cores elaborate thread instructions. The different speed of the various memories present on the GPU can be exploited for optimization of the CUDA code. Specifically, registers and shared memory are much faster than global device memory.