• No se han encontrado resultados

Programas curriculares y políticas de educación en nuestro país

2.2.2 ¿Cómo funcionan las endorfinas en el cuerpo?

4. Programas curriculares y políticas de educación en nuestro país

As a result of the previous approaches, we propose the dynamic placement approach. This ap- proach assigns the execution location according to the available CUs, the operator, and the used data. The actual operator implementation is not specifically optimized for any specific CU.

To evaluate this approach, we use the group-by operator from Section 3.2. We port the group-by operator to OpenCL in order to be able to execute the operator on multiple CUs. As for the optimizations from Approach II, we do not adjust any parameters according to the group count, however, we choose default parameters that differ slightly for each CU. All executions are fixed to use Murmur3 instead of FNV-1a, as this is a general software optimization (Sec- tion 3.2.4). Additionally, the input data is fixed to 1GB (⇡268M values), always located on the host side, and the fill factor is fixed to 0.5. Before execution on a new CU, we run a test execution with 50k groups, evaluating different configurations and choose the best performing one. The configurations consist of the number of elements per thread, #elem (directly influ- encing the number of threads), and the memory access type, i.e., coalesced or block-wise. The different memory access types are shown in Algorithm 2.

Coalesced memory access ensures that neighboring threads load neighboring data at the same time, allowing the memory access to be combined, if the CU supports this kind of mem- ory access (supported by most GPUs). Block-wise memory access ensures that one thread reads neighboring input data, leading to an improved cache usage for small amounts of cores and per-core caches (mostly beneficial for CPUs). These two different memory access patterns are important when working with different architectures like CPUs and GPUs, because the wrong memory access pattern could largely harm performance. After evaluating the ideal memory ac-

number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

K80: 64 elements per thread, coalesced

(a) Nvidia K80 number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

GT640: 8 elements per thread, coalesced

(b) Nvidia GT640 number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

Intel iGPU: 64 elements per thread, coalesced

(c) Intel iGPU number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

AMD iGPU: 1 element per thread,

(d) AMD iGPU number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

AMD CPU: 32 elements per thread, block

(e) AMD CPU

number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

Tahiti: 128 elements per thread, coalesced

(f) AMD Tahiti GPU

number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

Xeon: 16 elements per thread, block

(g) Intel Xeon CPU

number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000

Xeon Phi: 4 elements per thread, block

(h) Intel Xeon Phi

Figure 3.24: Testing the hash-based Group-By on different computing units showing different effects with varying group sizes.

cess pattern and the ideal number of elements per thread (#elem) for each CU, we can execute the operator with different amounts of groups. Figure 3.24 shows the selected configuration and the performance results for eight different CUs including different GPUs from Nvidia, AMD, and Intel, different CPUs from Intel and AMD, and Intel’s Xeon Phi. The hardware properties of the different CUs are presented in Table 2.3. All CPUs prefer block-wise memory access, while the GPUs prefer coalesced access. Only the AMD iGPU works best with one ele- ment per thread, where the choice of coalesced or block-wise memory access does not matter. Instead, the GPU internal scheduler defines the access pattern through scheduling the indi- vidual threads. The different CUs show different effects and limitations when executing the group-by operator. We describe these effects in the following:

Global Memory: Each CU stores the hash table in its global memory, limiting the maximal amount of groups that can be supported. The largest hash tables can be stored on the K80 (12GB), the Xeon Phi (16GB), the AMD CPU (32GB), and the Xeon CPU (64GB), while the other CUs only support smaller hash tables.

Host Memory Access: The presented CUs also differ in host memory access. While CPUs and integrated GPUs can access the host memory directly, the other CUs use direct memory access but have to transfer the data through the PCIe bus (generation 2 or 3). Especially the PCIe2 bus is limiting the transfer to a maximum of 6GB/s, which can be seen for the AMD Tahiti GPU (Figure 3.24f). There, the straight line at 0.2 ms (for 1GB of input data = 5GB/s) indicates that the runtime cannot be better than that.

Atomic Contention: Another heterogeneous effect is the impact of atomic contention. Each CU shows a slowdown for small numbers of groups but some CUs show better performance or a steeper slope of improvement. All Intel-based CUs show a significant impact of atomic contention, while especially the AMD Tahiti GPU shows the best results.

Caches: As seen earlier, caches have a high impact on performance, depending on the size of the hash table. For our test cases in Figure 3.24, the impact is clearly visible. All CUs have different cache sizes, hence, different hash table sizes, where the runtime increases. In general, we can see that CPUs and the Xeon Phi have larger caches than GPUs and, therefore, can show good performance even for larger hash tables.

Performance: Resulting from the mentioned differences, the CUs differ in performance sometimes showing surprising effects like the Intel iGPU (used in a low-power laptop) being faster than the high-end Xeon Phi Accelerator for 10 - 20k groups.

As there are many factors that differ for the given CUs when executing the group-by oper- ator, our hope is that one CU can mitigate the limitations of others, by switching the execution assignment depending on the ideal performance. To confirm this idea, we simulate having some of the presented CUs in one system and switch the execution according to the measured performance. Figure 3.25a shows the resulting execution behavior. For Figure 3.25a, we as- sume to have the Tahiti GPU, K80, Xeon Phi, and Xeon CPU in one system. We have to switch

number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000 A: Tahiti GPU B: K80 C: Xeon Phi D: Xeon A B A C D

(a) Tahiti GPU, K80, Xeon CPU and Xeon Phi

number of groups (M) runtime (sec) 0.01 0.1 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000 A: GT640 B: Intel iGPU C: AMD CPU A B A C

(b) GT640, Intel iGPU, and AMD CPU

Figure 3.25: Choosing the best CUs for different numbers of groups.

the execution five times to achieve the best performance for the whole range. The K80 can be used for a large range of group numbers, while the Tahiti GPU can be used to hide atomic contentions and the L2 TLB cache problems of the K80. The Xeon Phi and the Xeon CPU help to overcome the limited memory space of the K80 and Tahiti GPU. For Figure 3.25b, we assume to have a system consisting of the GT640, the Intel iGPU and the AMD CPU. In this scenario, three switching points are needed. The GT640 shows the best performance for atomic contention of small groups, while the Intel iGPU shows good performance before cache boundaries are reached. Then the GT640 can hide these cache effects again, while the AMD CPU is used for large groups due to the larger memory space.

These two examples show the potential of heterogeneous placement, where the execution is switched between different CUs to hide each others limitations. There are two questions that need to be answered: (1) Is this dynamic placement approach more beneficial than the static approach with a highly optimized implementation? (2) How can we achieve this dynamic placement automatically in a database system? We will discuss both questions in the following.