• No se han encontrado resultados

The problem of rendering and displaying graphics is in its nature a problem which, if to be done eciently, requires parallelization and hyperthreading [101]. Due to this fact, Graphics Processing Units (GPUs) are designed to handle multiple calculations at the same time, not only by employing multiple processor cores but also by enabling the dierent cores to process multiple threads at the same time. GPUs are therefore commonly put to use to other, parallelizable problems than graphics rendering [4, 35, 84, 90]. The GPUs are used as General-purpose graphics processing units (GPGPU). NVIDIA has a C framework called CUDA [2] for easier programming for their GPU devices, but will also have a slightly less advertised support for OpenCL [63], a similar but more exible framework for parallelization on dierent devices [64]. AMD's graphics cards support OpenCL [6]. The GPU architecture as exposed by the CUDA framework will be presented, and some major optimization possibilities will be pointed out.

The GPU follows a SIMD (Single instruction, multiple data) principle. The GPU is targeted towards employing the same processor instruction on dierent datasets [2]. Each processing core on the GPU can handle 32 threads at once. This handling is referred to as a warp . A warp can happen only provided each thread performs the same instruction in the same sequence on dierent data. There are therefore some limits to how well a GPU may parallelize a given problem. All threads need also be independent, so that the GPU may switch between threads at any given time to make up for thread lag in memory access. [2]

CUDA will organize its threads in blocks, as shown in gure 2.6. Each block is assigned to an available GPU multiprocessor, as is shown in gure 2.7.

Each multiprocessor will run the threads of its assigned block until completion, in warps of 32 threads each [2]. A necessary requirement for maximum multiprocessor utilization will hence be to have a number of threads per block divisible by 32. The blocks will be organized in a larger grid. Both the blocks and the grid can be one- or two-dimensional, but this has no performance benets and is for convenience when accessing any array data [2].

Each thread will run its own instance of a kernel [2]. This is a C-like function with some CUDA extensions, compiled for the GPU processor. The kernel is instantiated for a grid of blocks, and each thread in each block of threads will process the instructions in the kernel, on dierent data accessed by using thread- and block indices. Should any of the threads try to process dierent instructions, the warp will break down and the multiprocessor will run the threads sequentially instead of in parallel in groups of 32 [2]. GPUs have dierent kinds of memory [2]. The GPU has some DRAM, called global memory, and some multiprocessor-associated caches, a registry and shared memory. Compared to the last three, the DRAM is slow and transfers to and from this will be the bottleneck [1]. Each multiprocessor has a scarce registry into which local variables within the scope of each thread will be allocated. Any intermediate results in

Figure 2.6: The relation between CUDA blocks, the grid and each thread. The gure is taken from [2].

Figure 2.7: Distribution of blocks over the multiprocessors. Here, they are called streaming multiproces- sors. The gure is taken from [2].

a larger calculation will also temporarily be saved here. This will put an upper limit on the allowable number of threads per block, depending on the number of local variables needed per thread. In addition to the registry is there shared memory. CUDA may allocate arrays or variables in shared memory which will be shared across threads inside a given block. There are some limits to how this shared memory may be accessed. It is divided into banks . If two threads access the same bank (but not the exact same memory position), there will be a bank conict requiring the GPU to perform each memory access in sequence [2].

There are some requirements for optimal memory access [1]. Each multiprocessor has a cache that may load a given amount of bytes. When one thread tries to read four bytes from memory into a oating point variable, the cache will also load the subsequent bytes from memory in one chunk as far as there is space left in the cache (principle of spatial locality [69]). The threads in a warp will all benet from the same chunk of cache given that subsequent threads must access subsequent memory locations. Thus will the number of cache requests to the slow global memory be lessened. The warps will on the other hand break down if data from subsequent memory locations is not needed or if the needed memory is not properly aligned with the cache reading lines. Reading memory into the cache will in these cases be have to made sequentially, once for each thread [1]. Memory access is said to be coalesced when memory access can be done in parallel [2].

The computer to which the graphics card is hooked up to will be referred to as the host .

Some dierence in the results produced from similar GPU and CPU code is to be expected [2]. The GPU can decide to truncate small numbers to zero in specic settings. The order of the operations will decide the correctness of the results, and it will be important to perform the calculations in the correct sequence in order to produce similar results for CPU and GPU.

Each NVIDIA GPU has a designated compute capability, which describes which features the GPU can use. A GPU program must be compiled for a specic compute capability [2].

2.3 Real-time systems and general concurrency