6.4
Evaluation and Results of Hardware Synthesis
The evaluation of the presented synthesis approach focusses on different charac- teristics. The most obvious one is the impact on the runtime of the benchmark applications. Furthermore, different sizes of the CGRA are evaluated. As the parallelism within a given code sequence is limited, it may not be necessary to implement a large array of processing elements, but a small or medium sized array could be sufficient.
Additionally, examinations on the specific state machines that drive the synthe- sized functional unit will be presented. They will show a very lopsided distribu- tion of instructions within the benchmark application. Thus, a specialization of the CGRAs functionality, regarding these distributions, will be discussed. Finally, the application of a dual ported access to the CGRAs internal memory is evaluated.
“ The reference value for all measurements is the software execution of the benchmarks without synthesized functional units. Note: the mean execution time of a bytecode in our processor is ≈ 4 clock cycles. This is in the same
order as JIT-compiled code on IA32 machines.” [170]
6.4.1 Benchmark Applications
“ We chose applications of four different domains to test our synthesis algo- rithm. Firstly, we benchmarked several cryptographic ciphers as the impor- tance of security in embedded systems increases steadily. Additionally, we chose hash algorithms and message digests as a second group of appropriate applications, and furthermore evaluated the runtime behavior of image pro- cessing kernels. All of these benchmark applications are pure computation kernels. Regularly, they are part of a surrounding application. Thus, we se- lected the encoding of a bitmap image into a JPEG image as a benchmark ap- plication. This application contains several computation kernels, such as color space transformation, 2-D forward DCT, and quantization. Nonetheless, it also contains a substantial amount of code that utilizes those kernels, in order to encode a whole image.” [170]
98 CHAPTER 6. HARDWARE SYNTHESIS
The group of cryptographic ciphers contains 3DES, IDEA, RC6, Rijndael, Serpent, Skipjack, Twofish and XTEA. All of those ciphers are block ciphers. The generation of the round keys out of a master key, and the encryption of a single block of data of the native block size of the corresponding cipher have been evaluated.
Furthermore, the digests and hash functions BLAKE, CubeHash, ECOH, MD5, Radio Gatun, SHA1, SHA256 and SIMD have been chosen as benchmarks. Here, the execution of the digest/hash function on a block of the native block size of the function has been chosen as application kernel.
The group of image processing filters contains kernels for the application of grayscale and contrast filtering. Furthermore, the Sobel filter for edge detection and a swizzle filter for channel mixing have been evaluated.
Finally, a JPEG encoder has been chosen as a whole application benchmark. In- deed, the different kernels like color space transformation, discrete cosine trans- formation and quantization have been evaluated as well. Nonetheless, the atten- tion lies on the performance of the encoder as a whole. The kernel of interest encodes a whole 8x8 RGB block into a stream of JPEG coefficients. In order to avoid side effects caused by I/O, the resulting coefficients are written to an array, instead of an output stream or device.
A more detailed overview of the actual parameters and kernel characteristics of the benchmark applications can be found in appendix A.
6.4.2 Runtime Acceleration
As already mentioned, the synthesis does not support method invocations and access operations to multi-dimensional arrays yet. As most benchmarks contain such instructions, all invoked methods have been inlined, and all multi-dimensional arrays have been flattened to a single dimension. The measurement values for the presented diagrams can be found in appendix B.2.
The runtime evaluation has shown sophisticated results. The benchmarks gained speedups between 1.5 and 18.7. The achieved speedups are displayed in figure 6.20. Obviously, most benchmarks already perform at their maximum on an array of four processing elements. Nonetheless, few benchmarks benefit from dou- bling the arrays size to eight elements, while the application of 16 processing elements has almost no positive effects.
6.4. EVALUATION AND RESULTS OF HARDWARE SYNTHESIS 99
Average Speedup on Homogenous CGRA with 4 / 8 / 16 Processing Elements
4 8 12 16 20
Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE
Speedup 4 8 12 16 20 BLAKE CubeHash
ECOH MD5 SIMD SHA1 SHA256 RadioGatunContrastFilter GrayscaleFilter SobelFilter SwizzleFilterJpegEncoder CST 2-D DCT Quantization Speedup
Figure 6.20: Speedup of Benchmarks Through Runtime Dynamic Hardware Synthesis
The average speedup across all benchmarks is 7.38 on an array with four opera- tors and 7.78 on an array with eight processing elements. The Skipjack and SIMD benchmarks are outliers, which are caused by a large communication overhead of the CGRA. Similarly, the results of benchmarks like the swizzle filter or SHA256 can be attributed to an advantageous ratio of communication and calculation. In order to eliminate the outliers from the speedup numbers, the three best and worst performing benchmarks are eliminated from the mean value. This gives a more realistic evaluation of the speedups. Thereby, the average acceleration factors change to 7.05 – 7.39. Thus, it can be said that the hardware synthesis for homogeneous arrays delivers an average speedup of seven.
The separated calculation of speedups for the four application domains shows that the graphics filter benchmarks perform best. They achieve mean speed- ups from 9.96 to 10.34. The hash/digest functions are accelerated by average factors between 9.05 and 10.00. The cryptographic cipher benchmarks perform significantly worse. This holds especially for the single block encryption. While the round key generation is accelerated by average factors of 6.41 to 7.02, the encryption steps only reach speedups of 4.93 to 5.01.
100 CHAPTER 6. HARDWARE SYNTHESIS
Average Utilization of Processing Elements on Homogenous CGRA with 16 / 8 / 4 Processing Elements
0.2 0.4 0.6 0.8 1.0
Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE
Average Utilization 0.2 0.4 0.6 0.8 1.0 BLAKE CubeHash
ECOH MD5 SIMD SHA1 SHA256 RadioGatunContrastFilter GrayscaleFilter SobelFilter SwizzleFilterJpegEncoder CST 2-D DCT Quantization Average Utilization
Figure 6.21: Average Utilization of Processing Elements
6.4.3 Utilization Ratio of CGRA
Another interesting measurement is the utilization of the CGRA. It gives an indica- tion if the CGRA is large enough for the mapped applications and larger speedups are not possible because of missing parallelism, or because the CGRA does not have enough processing elements.
In figure 6.21 the utilization of the CGRA for the different numbers of process- ing elements is displayed. Even on an array with four processing elements a utilization of 60% is almost never exceeded. An increasing number of process- ing elements typically leads to a bisection of the utilization. This correlates with the speedup numbers shown in figure 6.20, which have shown that most bench- marks do not profit from additional hardware beyond four processing elements. As the parallelism contained in the code sequences is the obvious limitation to the synthesis approach, it is not necessary to evaluate arrays with 16 processing elements any further. Only a single benchmark took profit from this additional hardware effort, and that is why it is unjustifiable to use such a large array. Thus, only arrays with four and eight elements are evaluated in further examinations.
6.4. EVALUATION AND RESULTS OF HARDWARE SYNTHESIS 101
Total Number of Synthesized Functional Unit Contexts for a Given Benchmark on a CGRA with 4 / 8 PEs Number of Different Synthesized Functional Unit Contexts for a Given Benchmark on a CGRA with 4 / 8 PEs
128 256 384 512 640
Rijndael RKG Skipjack RKG 3DES RKG IDEA RKG RC6 RKG Serpent RKG Twofish RKG XTEA RKG
SFU Contexts 128 256 384 512 640
Rijndael SBE Skipjack SBE 3DES SBE IDEA SBE RC6 SBE Serpent SBE Twofish SBE XTEA SBE
SFU Contexts 128 256 384 512 640
BLAKE CubeHash ECOH MD5 SIMD SHA1 SHA256 RadioGatun
SFU Contexts 128 256 384 512 640
ContrastFilter GrayscaleFilter SobelFilter SwizzleFilter JpegEncoder CST 2-D DCT Quantization
SFU Contexts
Figure 6.22: Average Complexity of Controller Finite State Machines
6.4.4 Complexity of Finite State Machines
“ In a next step, we evaluated the complexity of the controlling units that were created by the synthesis. Therefore, we measured the size of the finite state machines that are controlling every synthesized functional unit. Every state is related to a specific configuration of the reconfigurable array. In the worst case, all of those contexts would be different. Thus, the size of a controlling state machine is the upper bound for the number of different contexts.
102 CHAPTER 6. HARDWARE SYNTHESIS
Afterwards, we created a configuration profile for every context, which reflects every operation that is executed within the related state. Accordingly, we re- moved all duplicates from the set of configurations. The number of remaining elements is a lower bound for the number of contexts that are necessary to drive the functional unit. The actual number of necessary configurations lies between those two bounds, as it depends on the place-and-route results of the affected operations.” [170]
The information regarding the different contexts of the finite state machines are shown in table 6.22. The diagram displays the number of states of the controlling state machines, as well as the number of actually different contexts for each of the benchmarks.
Obviously, only one third of the benchmarks (eleven of thirty-two) has controlling state machines with more than 128 states. Furthermore, only the synthesized configuration for the Jpeg-encoder benchmark on a CGRA with four processing elements contains more than 128 different contexts.
Thus, 128 entries seems to be a sufficient size for the configuration memory of a processing element in case only a single application kernel is mapped at a time. In order to map more applications at the same time, a larger memory, e.g. with 256 entries, is recommended.
6.4.5 Analysis of Executed Operations on CGRA
“ Another characteristic of the synthesized control units, is the distribution of multicycle operations like multiplication, type conversion, or division (complex operations) within the created contexts.” [170]
Figure 6.23 displays the distribution of complex operations within the datapath of the functional units, and thus within the state machines that actually control the CGRA. The distribution numbers account for a CGRA with four processing elements. More than 90% of all contexts do not contain any complex operations like division, multiplication or type conversion. Only memory access operations occur on regular basis in about a third of all contexts.