• No se han encontrado resultados

Percentage of SFU States with No Operation of the given Type Percentage of SFU States with exactly 1 Operation of the given Type Percentage of SFU States with exactly 2 Operations of the given Type Percentage of SFU States with more than 2 Operations of the given Type

20% 40% 60% 80% 100%

division multiplication type conversion memory access

Complex Operation Distribution

Figure 6.23: Distribution of Non-Combinational Operations within SFU State Machines

“ Thus, it is reasonable to reduce the complexity of the reconfigurable array, as a full-fledged homogeneous array structure may not be necessary. Hence, the chip-size of the array would shrink. Nonetheless, this would also decrease the gained speedup. The following subsection shows the influence of such a limitation on the runtime and speedup, with the help of small modifications to the constraints of our measurements.” [170]

6.5

Saving Hardware With Heterogeneous CGRAs

“ The results in the preceding subsections suggest the use of a heteroge- neous array, as more than 90% of the contexts that were created by our synthesis algorithm used two or less complex operators. Single operators of this array would not provide the full functionality from the preceding measure- ments, but a specific subset. Thus, the functionality would be distributed all over the array while reducing the operators chip size and resource consump- tion significantly.” [170]

As combinational operations are the largest group of operations by far, most oper- ators would provide combinational functionality only. The other operations should be distributed to other operators.

The distribution numbers shown in diagram 6.23 indicate, that division, multipli- cation and type conversion operations only occur sparse in the benchmark set.

104 CHAPTER 6. HARDWARE SYNTHESIS multiplication memory request combinational type conversion CGRA Operator ID: 0 division multiplication division combinational type conversion CGRA Operator ID: 1 memory request multiplication combinational type conversion CGRA Operator ID: 2 division memory request multiplication division memory request combinational type conversion CGRA Operator ID: 3 multiplication division memory request combinational type conversion CGRA Operator ID: 4 multiplication division memory request combinational type conversion CGRA Operator ID: n Configuration of a Coarse Grain Reconfigurable Array with Universal Functionality

CGRA Operator ID: 0 multiplication type conversion memory request CGRA Operator ID: 1 multiplication type conversion memory request CGRA Operator ID: 2 multiplication type conversion memory request CGRA Operator ID: n combinational CGRA Operator ID: 4 combinational CGRA Operator ID: 3 division

Configuration of a Coarse Grain Reconfigurable Array with Specialized Functionality

Figure 6.24: Specialization of a Coarse Grained Reconfigurable Array [170]

On an array with four operators, only 8.8% of all contexts contain one of the complex operation types. Thus, it may not be necessary to provide more than a single operator of each of those three operation types. As the division of two numbers is the operation with the highest latency, these operations are moved to a dedicated operator. The multiplication and type conversion may be processed by the same operator, as type conversions occur seldom.

Memory requests are the only instructions besides combinational operations, which occur in a significant amount of contexts. Thus, it may be useful to pro- vide more than a single operator for this functionality. As multiplication and type conversion only occur in 7.6% of all contexts, a joined operator for those two operations and memory requests should be sufficient.

The structure of such a heterogeneous array is displayed in figure 6.24. It provides a significantly reduced functionality, but also a slimed down hardware footprint in comparison to the also shown homogeneous CGRA. The effect of this reduction on the benchmark applications is shown in the following subsections.

6.5.1 Runtime Impact of Specialized Operator Sets

“ In order to analyze the effects of the aforementioned specialization, we re- configured our array to meet the given constraints. Firstly, we measured the runtime of our benchmarks on an array with a single division operator and a dedicated multiplication/type conversion operator.” [170]

6.5. SAVING HARDWARE WITH HETEROGENEOUS CGRAS 105

“ In a second evaluation iteration, we increased the number of multiplication /type conversion operators to three. Considering the resulting number of four non-combinational operators in this specific setup, it is not possible to evaluate an array of size four, as it does not contain any combinational operators. Thus, none of the benchmarks can be scheduled successfully.” [170]

The resulting measurements are presented in figure 6.25. It can be seen, that the specialization of the operator set influences the runtime of more than half of all benchmarks on an array with four processing elements. The average speedup drops from 7.38 on a homogeneous array with four operators to 5.89 on the heterogeneous array, which is a share of just above 20%.

The corresponding measurements on an array with eight processing elements show a slightly smaller impact, but nonetheless, almost the same number of benchmarks suffers from decreasing speedup. Here, the average speedup is decreased by an amount of 19%, and drops from 7.77 to 6.29.

Increasing the number of operators which are capable of executing multiplication, type conversion and memory access operations introduces more slack for the scheduling algorithm, which condenses in much improved runtime results. On an array with the regarding characteristics, the average speedup of 7.32 just lies 6% below the achieved results on a homogeneous eight operator array.

It has to be noted that some benchmarks achieve better results on a restrained reconfigurable array than on a homogeneous platform. The cause of these im- proved results are side effects of the scheduling of operations. As list scheduling produces sub-optimal results by design, a change of the given resource con- straints, even if they provide restrictions, may result in a better performance of the scheduled datapath. This is a normal side effect of heuristic algorithms. Despite those artifacts, an eight operator array with three multiplication/type con- version/memory access operators and a single division operator seems to be the most fitting characteristic for the AMIDAR coupled CGRA with list scheduling.

6.5.2 Tackling the Memory Bottleneck

“ The previously shown characteristics of the benchmark applications have shown that most operations are executed parallel to others. As many of

106 CHAPTER 6. HARDWARE SYNTHESIS

CGRA with 4 Processing Elements including 1 Divider and 1 Multiplier / Type Conversion / Memory Access Element CGRA with 8 Processing Elements including 1 Divider and 1 Multiplier / Type Conversion / Memory Access Element CGRA with 8 Processing Elements including 1 Divider and 3 Multiplier / Type Conversion / Memory Access Element Speedup on Homogenous CGRA with Equal Number of Processing Elements

4 8 12 16 20

Rijndael RKG Skipjack RKG 3DES RKG IDEA RKG RC6 RKG Serpent RKG Twofish RKG XTEA RKG

Speedup 4 8 12 16 20

Rijndael SBE Skipjack SBE 3DES SBE IDEA SBE RC6 SBE Serpent SBE Twofish SBE XTEA SBE

Speedup 4 8 12 16 20

BLAKE CubeHash ECOH MD5 SIMD SHA1 SHA256 RadioGatun

Speedup 4 8 12 16 20

ContrastFilter GrayscaleFilter SobelFilter SwizzleFilter JpegEncoder CST 2-D DCT Quantization

Speedup

Figure 6.25: Changes in Application Speedup Through Specialized CGRAs

our benchmarks rely on array operations, it seems reasonable to allow more than one operation at a time to access the object/array memory. This can be achieved by using a dual ported memory inside the [CGRA].” [170]

The effects of the dual ported memory access have been evaluated on basis of an eight operator array, with three multiplication/type conversion/memory access operators and a single division operator. The results are displayed in figure 6.26.

Documento similar