• No se han encontrado resultados

Typically, application source code is preprocessed by a compiler frontend to transfer the applica- tion(s) into an appropriate IR that is easier to analyze. The ISE procedure, which is pertinent to this thesis (c.f. Chapter 7), can be roughly structured into three steps: subgraph enumeration,

detection of isomorphic subgraphs and graph covering.

Subgraph enumeration computes every possible CI (regardless of its applicability) for a given IR-representation of an application. CIs in general encapsulate the computation of frequently executed subsets of the IR. Since compilers most often apply DFG structures as IR format, CIs consequently represent arbitrary subgraphs of these DFGs, which have to be convex at the same time.

Figure 3.6– Examples of subgraphs.

3.16. Definition (subgraph). A graph S = (V, E) is said to be a subgraph of another graph

G = (V, E), iff its node set Vis s subset of that of G and whose adjacency relation is a subset of that of G restricted to this subset:

V′ ⊆ V ∧ E′ ⊆ E

3.17. Definition (convex subgraph). Given a DFG G = (V, E), a subgraph S is called con- vex, iff no path exists from a vertex v ∈ S to another w ∈ S, which contains a vertex u /∈ S.

3.2. Instruction Set Extensions 53 Such subgraphs can be classified according to their number of input and output operands (sin- gle/multiple input/output operands) as well as according to their connectivity, i.e. if a subgraph comprises disconnected patterns or not. Furthermore, the microarchitecture may pose several additional constraints on the subgraphs that can be considered valid. First of all, the maximum number of input and output operands (Nin, Nout) may be limited due to usually limited encod-

ing space or the number of read/write ports in the register file. Secondly, some vertices may be forbidden as they represent nondesirable operations for CIs, e.g. loads and stores, if the planed functional unit is not going to have memory ports. In addition, other vertices are considered invalid implicitly, because their contents are computed outside to the current basic block. Thus, given a DFG G, the maximum number of input and output operands Nin and Nout and the set

of forbidden vertices F , the objective is to find all convex subgraphs S = (V′, E) under the

constraints that

|I(S)| ≤ Nin∧ |O(S)| ≤ Nout∧ V′∩ F = ∅,

where I(S) and O(S) denote the set of input and output nodes of the subgraph S, respectively. The enumerated subgraphs are furthermore very often evaluated by a merit function, which typ- ically reflects for each pattern the number of saved clock cycles under the assumption that the equivalent hardware instruction is executed within a single clock cycle of the underlying processor architecture. The result of this phase is a set of subgraphs for a certain DFG, representing all possible ISE instances ranked by a merit function metric.

Several methodologies have been developed in the past, which tackle subgraph enumeration from different purchases. First of all, a limited number of allowed input and output operands is the basis for several efficient approaches towards subgraph enumeration. Because exhaustive enumeration of arbitrary subgraphs features exponential runtime complexity [223], earlier approaches concen- trated on Multiple-Input-Single-Output (MISO) subgraphs [49, 92], which can be enumerated in linear time [224]. Furthermore, many approaches are restricted to only connected subgraphs [49, 56, 64, 89, 92, 224, 280], although including multiple disconnected components in a subgraph increases the potential to exploit parallelism on the level of IR-operations, which is particularly attractive for single-issue architectures [50, 70, 129, 184, 223, 281].

Isomorphic Subgraph Detection or graph matching in general is the process of finding a correspondence between vertices and edges of two (sub)graphs, which satisfy certain constraints, such that equivalent substructures of two (sub)graphs are matched together.

3.18. Definition (graph isomorphism). A graph isomorphism is a bijective graph homomor-

phism between two graphs Gα = (Vα, Eα) and Gβ = (Vβ, Eβ), such that

∃f : Vα 7→ Vβ with ∀v, w ∈ Vα∧ (v, w) ∈ Eα⇔ (f (v), f (w)) ∈ Eβ (3.2)

Probably the most prominent method for isomorphism detection is the approach described by Ullmann [257]. The underlying idea of [257] is to describe graphs Gα = (Vα, Eα) and Gβ = (Vβ, Eβ)

54 Chapter 3. Compilation and Instruction Set Extensions

as adjacency matrices A and B, respectively. In addition a |Vα|×|Vβ|-matrix M′ with mij ∈ {0, 1}

is constructed, such that every row contains exactly one 1 and every column contains not more than one 1. The algorithm’s objective now is to construct a matrix C = M′(MB)T, such that

(∀i∀j)16=i≤|Vα|

16=j≤|vα|(aij = 1) ⇒ (cij = 1)

holds and an isomorphism is found. The algorithm iteratively refines the matrix M′ starting from

a matrix

M0 =

(

1 : deg(vj) ≥ deg(vi), vj ∈ Vα∧ vi ∈ Vβ

0 : else

and changing systematically the elements of Mi in each iteration, such that all possible matrices

M′ in accordance to Equation 3.2 are generated and evaluated. The algorithm features runtime

complexity between Θ(n3) in the best and Θ(n!n2) in the worst case.

Graph isomorphism has spawned a wealth of literature in the past, which is not in the scope of this thesis. The interested reader may refer to [4], which provides an enumeration of existing literature. The final result of this phase is a partition of the set of subgraphs of a certain DFG into equivalence classes in accordance to the isomorphic information, i.e. all elements of an equivalence class being isomorphic to each other.

Graph Covering finally completes, based on the results of the preceding phases, ISE by select- ing the most beneficial set of subgraphs to be implemented into an architecture. The benefit of a CI can herein be computed as the number of saved cycles compared to an implementation with prim- itive operations. Covering has gained wide attention in the past. For simple tree-shaped patterns [36], optimal results can be obtained in linear time as already described in Section 3.1.3. However, this is mostly too restrictive as CIs are usually represented by Multiple–Input–Multiple–Output (MIMO) patterns. Such patterns are not matchable within a single DFT. Therefore graph-based covering methodologies have to be applied, which naturally feature exponential runtime complex- ity (if optimal) due to the NP-completeness of the problem.

3.3

Concluding Remarks

Although being orthogonal in general, automatic ISE and compilation of high-level languages feature an essential commonness: both processes identify a mapping from a given program rep- resentation to hardware instructions of a processor’s ISA. It is exactly this commonness, which motivates a combined treatment of compilation and ISE. Typical approaches of ISE are restricted to only a small number of basic blocks of an application, which have been identified in advance as hotspots by some profiler. Based on these basic blocks, instruction patterns are identified under the premise of maximizing the number of contained operations inside each pattern. Such

3.3. Concluding Remarks 55 patterns naturally bear a high degree of complexity and are therefore not easily applicable for compilation. Especially, if the IR-patterns of identified hardware instructions feature a fan-out larger than one, DFT-based pattern matching algorithms are not capable of handling them. Com- plex hardware instructions are therefore usually ignored by the code-selection phase and instead handled as Compiler Known Functions (CKF) or intrinsics. Basically, CKFs make assembly in- structions accessible within high-level code, where the compiler expands a CKF call like a macro. The procedure implies a manual modification of given applications, which is time-consuming, error-prone and furthermore restricts a utilization of hardware instructions to a small number of selected hotspots. To overcome this problem, ISE has to identify small reusable instructions, whose effectiveness is based on a high number of occurrences instead on a high number of con- tained operations, while the code-selection phase of a compiler has to incorporate a graph-based pattern matching algorithm in order handle arbitrary instruction patterns including those with a fan-out larger than one.

Chapter 4

Case study: Compiler-Agnostic Architecture

Exploration

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 64 64 128 128 256 256 512 512 1024 1024 1536 1536

Packet Size (Bytes)

C lo c k C y c le s Normal Rx Ops SPD Look Up SA Look Up Driver Overhead Encryption + Authentication Interrupt Servicing SA Update Normal Routing Reformatting Normal Tx Ops

• For each packet size, the left column represents a full software

implementation and the right column represents a full hardware

implementation.

SW HW

Figure 4.1– Break-up of tasks in typical VPN traffic.

Integrating security warranties into the IP stack inevitably influences overall IP processing perfor- mance (c.f. Chapter 2.2.1). Figure 4.11 shows break-ups of VPN-related implementation tasks and

their execution time in correlation to packet sizes. The columns alternately represent implemen- tations of VPN (via IPSec) in full software and hardware, starting with a software implementation

1

Based on a presentation slide of the “Stay Smart” road show by Motorola in March 2004.

58 Chapter 4. Case study: Compiler-Agnostic Architecture Exploration

processing an incoming packet size of 64 bytes. The figure identifies data encryption as the most computation intensive task in IPSec (especially for large IP packets). For the design of application- specific hardware, it is therefore one of the most promising candidates to increase overall packet processing performance through dedicated SFUs. Nonetheless, encryption algorithms are a sub- ject to continuous changes. Regularly they are cracked or replaced by newer ones, which is why reuse opportunities of SFUs towards newer algorithms have to be considered. The implementation of such algorithms in hardware (i.e. as a separate ASIC) offers indeed the best performance, yet it forfeits reusability with respect to different algorithms. For this reason a solution based on a programmable core is preferable.

This chapter showcases the development of a programmable coprocessor for efficient IPSec encryp- tion. The case study aims at illustrating the methodology of iterative architecture exploration using the tool suite of the Synopsys Processor Designer. Through the design of a programmable coprocessor featuring a customized ISA for the symmetric-key block cipher algorithm Blowfish, a representative example of the efficiency of customized ISE in the domain of protocol processing is given. Here, a coprocessor design provides the loosest coupling (e.g. via shared memory) towards different (main-)processor architectures and hence, increases reusability of encryption-specific CIs as well. The Blowfish algorithm is representative of a vast spectrum of block cipher algorithms due to its simple and common structure. Block cipher algorithms are widely used in the area of encrypting communication channels as found in the Internet. This case study omits the develop- ment of an optimized compiler for automatic utilization of encryption-specific CIs to stress the requirement for it (Figure 4.2).

Figure 4.2– Overview of compiler-agnostic architecture exploration.

In fact, encryption-specific CIs are manually utilized through CKFs, which implies the manual modification of targeted applications.

The remainder of this chapter is organized as follows: First, Section 4.1 surveys the applied architecture exploration framework and its methodology. The following Section 4.2 gives an illustration of the target application while focusing on the encryption functionality. This is followed

4.1. System Overview 59 by a detailed presentation of the successive refinement flow for the joint processor/coprocessor optimizations in Section 4.3 as well as the obtained results. Section 4.4 concludes the chapter.

4.1

System Overview

In order to design an efficient NPU, like any other ASIP, DSE (Figure 4.3) at the processor architecture level needs to be performed [147, 152]. It is usually an iterative process beginning with an initial architectural prototype and software implementations of appropriate target applications. The applications are executed and profiled on this prototype to detect performance bottlenecks. Based on profiling results, the designer refines the basic architecture improvements step by step (e.g. by adding CIs or by fine-tuning the architecture) until it is sufficiently tailored to the targeted set of applications.

ADL Description Generator

HDL Description

Gate Level Synthesis Specification

Evaluation Results

Profile Information, Application Performance

Evaluation Results

Clock Speed, Chip Area, Power Consumption E X P L O R A T I O N I M P L E M E N T A T I O N C-Compiler Assembler Linker

Simulator & Profiler

Figure 4.3– Tool based processor architecture exploration loop.

This iterative exploration approach requires very flexible retargetable software development tools (C-compiler, assembler, co-simulator/debugger etc.) that can be quickly adapted to varying target processor/coprocessor configurations, and a methodology for efficient MP-SoC exploration on the system level. Retargetable tools permit to explore many alternative design points in the explo- ration space within short time, i.e. without the need of the tedious complete tool re-design. Such

60 Chapter 4. Case study: Compiler-Agnostic Architecture Exploration

development tools are usually derived from a processor model given in a dedicated specification language.