• No se han encontrado resultados

3. RESULTADOS Y DESARROLLO ARGUMENTAL

3.1. Resultados

3.1.3. Exactitud de los modelos para el acelerómetro nECG

The GPU algorithms we present in Chapter7and Chapter8were implemented using the NVIDIA CUDA API1. CUDA allows to implement data parallel problems in so called kernels, which are executed by millions of GPU threads. According to Flynn’s taxon- omy [Flynn 1972] on a high level a GPU can be categorized as a multiple-instructions- multiple-data (MIMD) device, as groups of different GPU threads can process different kernel instructions on different data per clock. The groups themselves operate in a single- instruction-multiple-data (SIMD) fashion. Kernels are implemented in CUDA C, which is the language C with some extensions for GPU programming. We give a brief introduction on the basic aspects of GPU programming with CUDA [NVIDIA 2017a]. A non-proprietary platform independent alternative for GPU programming is OpenCL2. The Intel SPMD Pro- gram Compiler3takes a very similar approach to CUDA but is intended for exploiting the vector units of CPUs.

3.1

Kernels, Grids, and Blocks

We start our introduction with kernels, grids, blocks, multiprocessors, and their relation- ships, which are also depicted in Figure3.1. A CUDA kernel is executed by a grid of up to millions of threads and operates on data stored in device memory (GPU memory). The user side which invokes the kernel is called the host. A grid is decomposed into user spec- ified equally sized blocks of at most 1024 threads. CUDA C code can access the global

1https://developer.nvidia.com/cuda-downloads 2https://www.khronos.org/opencl/

__global__

void VecAdd ( float* A, float* B, float* C, int N) {

int i = threadIdx.x + blockDim.x * blockIdx.x; if (i < N)

C[i] = A[i] + B[i]; } Kernel in CUDA C B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 ... Grid 0 1 2 N − 1 . . . threads Block 11 B0 B1 B4 B6 B8 B11 B12 B15 ... SM 0 t B2 B3 B5 B7 B9 B10 B13 B14 ... SM 1 GPU with 2 SMs B0 B1 B8 B13 ... SM 0 t B2 B3 B10 B14 ... SM 1 B4 B5 B9 B11 ... SM 2 B6 B7 B12 B15 ... SM 3 GPU with 4 SMs launch grid execute blocks

Figure 3.1: Depiction of the relationship between kernels, grids, blocks, and multiproces- sors. To execute a kernel a grid of threads is launched. A grid is decomposed into smaller units called blocks (B) which are waiting (red) to be executed. Each block has the same programmer specified number of threads N. Blocks are executed (green) in parallel by the streaming multiprocessors (SM) of GPUs. The number of blocks a multiprocessor can process at a time depends on different factors. In our example this resulted in two blocks at a time. GPUs can have different numbers of multiprocessors. The example shows two GPUs with 2 and 4 multiprocessors, respectively. A multiprocessor replaces terminated blocks with new blocks waiting for execution. As runtime of blocks can vary new blocks are scheduled dynamically. (Partially based on [NVIDIA 2017a], Figure 5)

index of a block and the block-relative index of threads in a block via built-in variables. Both indices together allow to compute unique global indices to identify the portion of the input data each thread has to process. A GPU possesses NSM∈ N so called multiprocessors which process blocks. A block is mapped to exactly one multiprocessor, which can only process a GPU dependent maximum number of blocks at a time. Thus, only a subset of the millions of threads of a grid is actually active at a time. There is neither a guaran- teed processing order for the whole set of blocks nor a predetermined mapping of blocks to multiprocessors. Threads in blocks of different multiprocessors can process different instructions on different data per clock. The GPU dependent maximum number of active threads per multiprocessor is also limited. This maximum number cannot be reached if the block size is not a proper divisor of this limit or is so small that the active block count limit is reached first. In NVIDIA terminology the ratio of active threads to the maximum possible amount of active threads is called occupancy. In practice, occupancy is mainly further limited by the register and shared memory usage of a kernel. Depending on the complexity of a kernel the compiler determines the maximum amount of registers needed

3.2. Warps per thread for execution of the kernel. Shared memory is a resource that can be allocated per block that we will briefly introduce in Section 3.4. Only a small amount of both re- sources is available per multiprocessor. At times it can be beneficial to partition kernel code into several smaller kernels which might have higher occupancy and thus possibly higher performance. Using streams several kernels can be launched in parallel, if resource usage allows. Streams also allow to asynchronously transfer data to and from the device memory while kernels are running.

3.2

Warps

CUDA implicitly partitions blocks into groups of 32 threads which are called warps. Warps are processed by different single-instruction-multiple-data (SIMD) units of a multiproces- sor. The actual SIMD width of a SIMD unit varies from GPU to GPU and also depends on the type of instruction. However, to the programmer warps appear to be of SIMD nature with a SIMD width of 32 as threads in a warp are implicitly synchronized after each in- struction. Starting with the NVIDIA Fermi GPU architecture each multiprocessor has at least two warp schedulers, which can each issue instructions to different warps at a time. This means that a multiprocessor is a MIMD device itself. As a block consists of more warps than there are warp schedulers only a subset of the active threads per multiprocessor is issuing instructions at a time. But the larger amount of active threads is still necessary. GPUs are optimized for high instruction throughput at the cost of different kinds of high latencies. SIMD units rely on switching between warps in the pool of active threads to hide those latencies by issuing instructions of other ready warps. If enough instruction level parallelism is available warps have to be switched less often and full performance can be achieved with lower occupancy.

3.3

SIMD and SIMT

While multiprocessors essentially posses SIMD units NVIDIA coined the term SIMT for single-instruction-multiple-threads as there are some differences to traditional SIMD in terms of programming and hardware. On the programming side the CUDA C programming language, in which kernels are programmed, mainly expresses kernels from the point of view of a single scalar thread, which has a global ID to identify the data it has to process. Thus, code is independent of the SIMD-width. In contrast, SIMD programming directly exposes the underlying SIMD-width in its separate sets of instructions (or intrinsic func- tions). The hardware implementations of Intel’s SSE, AVX, and AVX-512 technologies have 4-, 8-, and 16-wide SIMD units, respectively [Intel 2017]. Code has to be specifically adapted to the SIMD width of the used technology. On the hardware side each thread or lane in an active warp has its own set of registers and, more importantly, own program counter. This already enables a lane to define a single thread of execution, justifying the term thread. Reading from (gather) or writing to (scatter) individual memory addresses per lane is also directly supported in hardware and in contrast to SIMD programming does not require explicit handling in software.

What SIMT has in common with SIMD is potentially lower SIMD efficiency with con- ditional code. As the name implies SIMT can only execute a single instruction on multiple threads at a time. Execution diverges when different threads in a warp take different con-

Device Memory L2 Cache

L1 Cache

Local Memory Global Memory Texture Memory Constant Memory Shared Memory

Texture Cache Constant Cache On-Chip Scratch Memory

Thread Grid Block

Figure 3.2: Diagram of the different memory spaces (rounded boxes) and their scopes (white boxes). The underlying hierarchy of caches and device memory is depicted as well. Arrows indicate read/write or read-only access. The presence of the L1 cache varies between GPU series and also between models in a series. (Partially based on [NVIDIA 2017a], Figure 7)

trol flow paths. The separate branches have to be executed one after the other. Deeply nested conditional statements can lead to complete serialization of the execution of a warp. While with traditional SIMD units the programmer has to manually create lane masks to deactivate SIMD lanes, SIMT automatically handles masking in hardware. If all threads decide on the same branch of a conditional statement no divergence occurs. But there is a key difference to SIMD. As every SIMT lane has its own program counter exe- cution automatically continues after the conditional statement after the branch has been processed. The SIMD programmer has to explicitly branch past the untaken branch to prevent unnecessary issuing of instructions where all lanes are deactivated.

3.4

Memory Spaces

CUDA provides access to different memory spaces, which differ in purpose, scope, and access characteristics. We give a brief introduction to each memory space. Figure 3.2 gives an overview on the scope of the memory spaces and their hierarchical relationship with caches and device memory.

Device and Global Memory Device memory is the off-chip main memory of a GPU. Es- sentially every CUDA kernel at least works on device memory, as it is the only memory area which allows to exchange data with the host. While it is possible to directly read from and write to system memory from the GPU, this is unpreferable as bandwidth through the PCI express bus is about two orders of magnitudes lower than for device memory access. From the scope of a grid, device memory is also called global memory as it is globally readable and writable for all threads in a grid.

The first generation of CUDA-enabled GPUs had very strict so-called coalescing rules for efficient global memory access which we will not discuss here. All following generations

3.4. Memory Spaces

128 160 192 224 256 288

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 warp

Figure 3.3: Depiction of a warp accessing global memory via the L1 cache (red) or the L2 cache (green). The L1 cache access results in two 128 byte transactions as the warp accesses two 128 byte L1 cache lines. On a miss, 256 bytes would have to be loaded from the L2 cache. The L2 cache access results in four 32 byte transactions as the warp accesses four 32 byte L2 cache lines. (Based on [NVIDIA 2017a], Figure 16)

have an L2 cache and some have an L1 cache which have much simpler coalescing rules. The presence of the L1 cache varies between GPU series and also between models in a series. Global memory is divided into segments of 32 bytes for the L2 cache and 128 bytes for the L1 cache. When threads in a warp access global memory, the access is simply split into as many memory transactions as different segments are accessed. Figure3.3depicts this for L1 and L2 accesses. In the worst case each thread in a warp accesses a different segment, which results in 32 transactions. Thus, to keep the number of transactions low threads in a warp should access nearby addresses, or related data should be kept closer together.

Local and Constant Memory Local memory and constant memory are two additional memory types, which reside in device memory. Local memory derives its name from the fact that it has thread local scope. It is mainly used for register spilling if a kernel uses too many registers, or thread scope arrays, which have no static access pattern. The thread local traversal stack used in the GPU ray tracing kernels fromAila and Laine [2009], for example, ends up in local memory, as it is accessed in an unpredictable manner. Constant memory has a designated cache, which is optimized for multiple simultaneous 4-byte ac- cesses to the same address. Thus, it is meant for constant data that is needed by several threads at the same time. Simultaneous accesses to multiple addresses are serialized into the number of different addresses.

Texture Memory Texture memoryis the last type of memory, that resides in device mem- ory. It allows 1-, 2-, or 3- dimensional indices for addressing with optimized performance of lookups in a 2D or 3D neighborhood. For the last two variants the input data first has to be converted into a CUDA Array, which stores the data in an optimized opaque propri- etary memory layout. All NVIDIA GPUs access texture memory via an additional dedicated read-only L1 cache. The CUDA programming guide is unspecific regarding optimal tex- ture memory access patterns. The only hint is that “[if] the memory reads do not follow the access patterns that global or constant memory reads must follow to get good perfor- mance, higher bandwidth can be achieved providing that there is locality in the texture fetches” [NVIDIA 2017a]. As we will see in Chapter7, Section7.2.1our experiments with certain access patterns, which are bad for either global or both global and shared memory,

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 simplified warp with 16 threads

threads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 shared memory conflicts 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 0

Figure 3.4: Depiction of the organization of shared memory into banks and access pat- terns in a warp which cause bank conflicts. For clarity the warp size and number of banks is reduced to 16. Consecutive 4-byte words are periodically assigned to the memory banks. Threads 3 and 4 cause a two-way bank conflict as they access two different words in the same bank. Threads 7, 8, 9, and 10 cause a three-way bank conflict as they access three different words in the same bank. (Based on [NVIDIA 2017a], Figure 18)

reveal an almost equal performance for texture memory with either access pattern. From its computer graphics origins texture memory provides some additional hardware features such as different addressing modes, value interpolation, and unpacking of specially stored data. These features are not important in the context of this dissertation.

Shared Memory Shared memoryis located on-chip and as such has “much higher band- width and much lower latency than local or global memory” [NVIDIA 2017a]. It has block scope and is intended to be shared between threads in a block in a cooperative way. The amount of available shared memory is only a couple of kilobytes per multiprocessor. It is organized in 32 banks, which can be accessed simultaneously. Consecutive 4-byte words are periodically assigned to the 32 banks. That is, shared memory addresses which are 128 byte apart are assigned to the same bank. The bandwidth of each bank is one 4- byte word per clock. Multiple threads are allowed to access different banks or the same bank. When several threads access different words which are assigned to the same bank a bank conflict occurs. In this case the accesses to the different words have to be serial- ized, which effectively reduces the instruction throughput. Thus, efficient shared memory usage aims at reducing the number of bank conflicts. Figure3.4depicts the shared mem- ory organization and conflicting access patterns. Common strided access patterns of the

form

threadIdx*stride

cause bank conflicts if

stride

has common divisors with the

number of banks and should be avoided if possible. The worst case is if the stride is the number of banks itself, in which case we have the number of banks as many conflicts.

3.5. Block Cooperation and Synchronization

3.5

Block Cooperation and Synchronization

Threads in a block can work cooperatively by exchanging data via slow global memory or the fast on-chip shared memory. To avoid data hazards between threads, special synchro- nization instructions have to be explicitly inserted into the kernel code. While choosing larger block sizes allows more threads to cooperate, block synchronization time increases. Also the occupancy of a multiprocessor temporally decreases the more warps arrive at the synchronization barrier. This can reduce the latency hiding efficiency of a multiprocessor. Section3.2 mentioned that threads in a warp are implicitly synchronized after each instruction. This implicit synchronization can be used to avoid explicit synchronization if a problem can be partitioned into warps in a meaningful way. The downside of this is that kernel code is not oblivious to SIMT-width anymore and might break if future GPUs have a different SIMT-width. The implementations of our GPU algorithms in Chapter 8 exploit implicit warp-level synchronicity to achieve higher performance. CUDA provides special hardware supported warp voting and warp shuffle functions which allow to effi- ciently exchange data between threads in a warp without additional memory and explicit synchronization, but are out of scope of this introduction. According to [NVIDIA 2017a], Section H.6.2 future NVIDIA GPU generations remove the implicit warp synchronization. Instead, special warp synchronization functions, which also allow sub-warp synchroniza- tion, have to be used.

Chapter 4

On the Geometric Probability Function

of the Surface Area Metric

Contents

Documento similar