BOLETÍN OFICIAL DEL ESTADO
MÓDULO FORMATIVO 6
There are many good survey papers on high-level synthesis, e.g., McFarland et al. [74], Gajski et al. [75], Coussy et al. [76], followed by some recent surveys focusing on the current state of specific HLS tools [78, 79]. which give a good overview of the tasks HLS has to perform and some insight to its evolution. The task of HSL tools is to transform a behavioural description into register-transfer level (RTL) design [74,76]. Due to McFarland [74], behavioural description specifies the way the system or its components interact with the environment, i.e., mapping from the inputs to the outputs. The behavioral description is entered using a high-level language (HLL), for example C, C++, or its extensions like SystemC or even MATLAB. The HLLs are unable to capture timing concepts like cycle-to-cycle behavior, however, the tools require designer-specified constraints (e.g. timing constraint) and optimization goal (e.q. timing-driven optimization).
The tasks of the HLS tools [74,75,76] include: • compilation and modelling
• resource allocation
• scheduling of operations to clock cycles • binding
• generation of the RTL architecture.
Compilation starts with operation decomposition, identification of data and control dependencies and transforms the behavioural description into dataflow graph (DFG) or control and dataflow graph (CDFG). A DFG can capture parallelism, but does not support loops. In a DFG, nodes rep- resent the operations and edges their inputs and outputs. In a CDFG, nodes are the basic blocks, which contain data dependencies but do not include any branches, and edges, which can be condi- tional, capture the control flow between them. The only parallelism in a CDFG is within the basic blocks, but further analysis is needed to find parallelism between basic blocks. This is accom- plished using techniques such as loop unrolling, loop pipelining, loop merging and loop tiling. To summarize, during this process the HSL tools extract parallelism, find the common subexpressions, perform loop unrolling, etc.
Allocation, scheduling and binding have access to a library of RTL components, i.e. available hardware resources. The components are annotated with characteristics such as area, delay, etc. Datapath (and control) allocation determines the type and number of components, generates the interconnects and performs hardware minimization. Scheduling cuts the DFG or CDFG into clock cycles, and schedules operations in such a way that the functionality is preserved. This process is aware of the available resources and an operation can also be scheduled to more than one clock cycle. Following is the binding: operations to components, variables to storage elements, transfers to interconnects. Optimizations such as register reuse for variables that have nonoverlapping life- time is possible at this stage. However, different levels of binding are possible, where less binding delegates more tasks to logic and physical synthesis, which have more room for optimization as they have more accurate timing estimates and and access to placement and routing. Different lev- els of binding are captured with code annotations in the RTL design generated by the HLS tool. Allocation, scheduling and binding can be done in different order, for example, if the optimization goal is to minimize the total area, including interconnect length, while meeting the timing con- straints, HSL tools start with scheduling and perform allocation during scheduling. HLS tools use a lot of different approaches, from graph theory, game theory, genetic algorithms, integer linear programming, etc.
Another frequently mentioned theme is the distinction (and mix) of top-down and bottom-up ap- proaches. McFarland [77] speaks about the evolution of HLS from experience of human designers, who rely on their knowledge of low-level characteristics of structures used in the implementation to guide high-level decisions. This information is included as a library and used to evaluate differ- ent RTL structures for the same behavioural description. Many authors also discuss partitioning, clustering and even place and route information used to improve RTL designs[74,77,81]. An early
attempt is the BUD program (bottom-up design), which performs global allocation and scheduling by evaluating different design decisions based on their effect on the design, i.e., a generate-and-test algorithm.
The aforementioned brief discussion on evaluation of different RTL structures leads to the con- cept of design space exploration. One of the advantages of top-down approach is, thanks to [80], the flexibility in exploration of possible designs, e.g., how much of the possible parallelism is ex- ploited. A brief discussion on the design space exploration, viewed through two metrics, namely area and delay, is also included in the survey paper by McFarland et al. [74]. It touches the problem of complexity arising from the extremely large number of possibilities and the difficulty evaluating designs in the early stages of the design flow. Numerous approaches to automated design space exploration (DSE) have been proposed for different levels of abstraction and alyways in conjuction with other (lower level) synthesis tools [82, 83, 84]. In its most basic forms, the design space is considered using area and delay metrics only, while others also include power [84]. As there are many parameters that affect the design in numerous ways, many iterations and design space reduction approaches are used.
Chapter 4
Related work
4.1 Hardware implementations of selected cryptographic
schemes . . . 54
4.2 Cryptographic hardware and complexities . . . 64
4.3 Hardware design automation and synthesis tools . . . 67
4.1
Hardware implementations of selected cryptographic
schemes
With the purpose to outline lightweight cryptography and to show the applicability of FSR based systems, this section begins with two relevant topics; namely a brief historical discussion on lightweight cryptography and its applications (Subsection4.1.1), and a collection of the hardware implementation results for Grain and Trivium, two FSR-based ciphers (Subsection 4.1.2). This subsection is followed by the hardware implementations of WG stream ciphers (Subsection4.1.3). The WG-stream ciphers are also FSR-based, but operate over an extension field; for larger WG instances, tower field implementations are beneficial, hence presented in Subsection4.1.4.
In this thesis, FSR based systems are presented in Chapter6, the WG cipher is used as a case study throughout many chapters, and the WG-permutation based cipher WAGE is used as a case study in PartVI. As this thesis focuses on ASIC implementations, many FPGA results from the literature are omitted.