9. COMUNICACIÓ INSTITUCIONAL A CAPITALS DE COMARCA
9.2 Qualitat informativa i transparència de la comunicació pública
A major factor that can limit the performance of SIMD architectures such as the one in Figure 6.2 is control flow divergence. In a MIMD architecture, each compute core can follow a different execution path with no loss of efficiency. In the SIMD architecture, on the other hand, the entire cluster is capable of executing only one control flow path at a time. If different threads need to execute different code paths, the cluster will need to be serialized.
We refer to a group of scalar threads executing on the parallel processing elements of a single SIMD cluster as a warp. From the programmer’s perspective, each thread in the warp is capable of executing an independent control flow path.
However, since the hardware is only capable of executing one path at a time, we must serialize threads whenever the control flow diverges, executing the subset of threads following one control path followed by the threads executing a different path.
A simple mechanism for handling control flow divergence is predication with guarded execution. Predication, which was implemented in traditional vector processors, is a simple means of handling simple control divergence situations such as code hammocks. However, predication requires every thread to execute every control flow path (committing only results from the taken paths). With any complex control flow, predication is insufficient.
A more general mechanism for handling control divergence in SIMD architec-tures is a stack-based approach [11]. In this approach, a PC is maintained for each thread. As long as all the threads in a warp are executing the same PC, all the threads are executed. When there is a branch where threads diverge, a sub-warp of threads going in one direction is executed while the stalled threads are pushed onto a stack. When the sub-warp reaches the reconvergence point, the threads in
PC code
Figure 6.3: Example code for stack-based mechanism for handling control flow divergence.
it are stalled while the previously stalled threads are popped from the stack and executed.
An example of the operation of the stack-based mechanism is shown in Ta-ble 6.2. (The code is shown in Figure 6.3.) In this example, there are two branchs (instructions 1 and 2) and two reconvergence points (instructions 6 and 9). When threads diverge (cycles 2 and 3), threads following one of the paths are executed and the remaining threads pushed onto the stack. When these threads reach the reconvergence point, these threads are swapped with those on the top of the stack.
When the remaining threads reach the reconvergence point, the top of the stack is popped and the reconverged threads can continue executing.
The reconvergence points can be determined statically, by analyzing the con-trol flow graph of the program. The naive approach is to use the end of the thread as the reconvergence point. This is far from optimal, as it performs far from optimally on code that reconverges long before the end of the thread. A much better choice for the reconvergence point is the immediate postdominator of each branch. This approach is near optimal on most code, including VISBench, though there exist pathological cases for which it works poorly [27].
To determine the relative performance of the SIMD architecture from
Fig-Table 6.1: Example of stack-based mechanism for handling control flow diver-gence. Threads marked with an x are executing in the given cycle.
thread PC
ure 6.2 versus the MIMD architecture from Figure 6.1, we simulate the execution of the VISBench applications on a SIMD architecture assuming the stack based reconvergence mechanism and the immediate postdominator as the reconvergence point. We idealize the memory system and assume unit latencies, and compare the instruction throughput per pipeline of the SIMD and MIMD architectures. To minimize the divergence within thread warps, we group together threads which operate on adjacent data elements.
Figure 6.4 plots the relative IPC of the SIMD architecture (per pipeline, rela-tive to MIMD) for varying warp sizes. This number is the fraction of instruction issue slots that are successfully filled, excluding those which cannot be filled due to control flow divergence.
All of the application have either a level or declining SIMD efficiency as the warp size grows. This trend is unsurprising, as a larger warp will naturally tend to
have a greater incidence of control flow divergence than a smaller warp. However, the degree of this trend varies by application. For some applications (H.264 and MRI), the SIMD efficiency is very high, even for large warp sizes. In MRI, which is basically a convolution, all of the threads execute exactly the same code path, resulting in SIMD efficiency remaining perfect for all warp sizes within the range of the study. In the H.264 SAD kernel, most threads execute the same path, except those at the edges where the image needs to be padded. As padding tends to occur in adjacent threads, the SIMD efficiency remains high. The remaining applications, on the other hand, show a pronounced decrease in SIMD efficiency as the warp size grows. These applications all have data-dependent branching.
At small warp sizes, these applications are still able to fill most of the issue slots, but with larger warp sizes the efficiency trends towards zero.
While SIMD has a cost in IPC, MIMD has a cost in chip area. MIMD requires all the control flow logic of the core to be replicated for each pipeline, whereas SIMD only requires one set of control logic per cluster. However, SIMD still requires the register files, functional units, caches and cache ports, and data bits of the pipeline latches to be replicated. We find that roughly 40% of the pipeline area does not need to be replicated. Hence, a 2-way SIMD cluster has 1.6 times the area of a scalar pipeline, while a 4-way has 2.8 times the area.
Figure 6.5 shows the ratio between area and IPC per cluster for varying warp sizes. The benchmarks in VISBench separate clearly into three groups. The first group is the one that has nearly perfect SIMD efficiency (i.e. H.264 and MRI).
The second group (Blender and ODE), shows a modest performance benefit for small warp sizes, but then exponentially decreasing performance as the warp size grows large. The third group (POVRay and Facedetect) shows performance loss with any level of SIMD, and exponential performance loss with large warp sizes.
Note that we are assuming perfect memory, so this result is an upper bound on
Figure 6.4: SIMD efficiency versus warp size.
the SIMD efficiency.
One thing we have not considered in this study is algorithmic changes to improve SIMD efficiency. Blender and H.264 required a moderate amount of hand tuning in order to achieve their level of SIMD performance, but were not altered significantly at the algorithm level. With additional programmer effort, one may be able to reclaim much more of the performance loss from SIMD, but this optimization comes at a substantial cost in development time. For instance, work has shown that the SIMD efficiency of ray tracing can be improved, though even with major algorithmic changes it remains well below 100% on complex scenes [28].
This result indicates that SIMD, while a substantial constant factor win for some applications, is a much larger performance loss for others. It also illustrates one of the limitations of GPUs as they expand into more general purpose appli-cation domains. For numerical appliappli-cations and the traditional appliappli-cations for GPUs, SIMD is the right design choice. However, the performance potential of SIMD architectures is limited as the space of applications expands.
Figure 6.5: SIMD performance per area versus warp size.