• No se han encontrado resultados

In the integer programs of the SPEC92 benchmark suite, on average 19% of the executed instructions are branches [Patterson96 (page 105)] and on average 62% of them change the control flow [Patterson96 (page 166)]. This means that about 12% o f the instructions executed in these programs change the control flow — almost one in eight. Current Superscalar machines fetch up to four instructions each clock cycle. For the next generation, it is going to be possible to fetch eight or more instructions each cycle. This means that almost every fetch will contain a branch that will change the control flow. Since these branches are distributed evenly throughout the address space, many of these fetch cycles (almost half for an 8-wide fetch) will be only partially effective. In addition, instead of incrementing the program counter to the

next fetch address, Superscalar machines capable o f fetching eight instructions per cycle will have to find the target address of a branch (possible more than one) almost every cycle.

Several high bandwidth fetch mechanisms based on the conventional instruction cache have been proposed [Conte95a, Seznec96, Yeh93b]. In such mechanisms, on every cycle instructions from non-contiguous locations in the instruction cache are fetched and assembled into dynamic sequences using information collected by dynamic branch predictors. To do this, branch target tables are inspected and pointers are generated to all non-contiguous instruction blocks. A moderately to highly interleaved instruction cache is accessed and provides multiple lines simultaneously. These lines are aligned by an alignment network, which then sends the instructions to the decode stage of the Superscalar processor.

The disadvantage o f these high bandwidth fetch mechanisms is their complexity. Sophisticated dynamic branch predictors, interleaved multiport instruction caches, and complex alignment networks are required to make them work. The Trace Cache architecture, on the other hand, avoids this complexity by caching dynamic instruction sequences, rather than only the information for constructing them [Rotenberg96]. A diagram o f the Trace Cache architecture is shown in Figure 3.1.

Machines employing the Trace Cache architecture take advantage of code execution locality to achieve performance. A machine that follows this architecture fetches instructions from the instruction cache and attempts to schedule them across multiple functional units using, for example, the Tomasulo’s algorithm [Tomasulo67]. These instructions are then grouped by a Fill Unit [Melvin88] and placed in a trace cache, which stores them in execution order, as opposed to the static order determined by the compiler. On an instruction fetch, the trace cache will provide a line o f instructions if available. This line can encompass more than one line from the instruction cache through merging of lines affected by partial fetches caused by taken branches: this increases instruction bandwidth and throughput.

The Trace Cache architecture has attracted significant research interest due to its potential for supplying enough instructions to make aggressively parallel Superscalar machines viable, and several aspects o f it have been studied recently

[Friendly97, Jacobson97, Patel97, Patt97, Rotenberg97, Smith_JE97, Vajapeyam97, Friendly98, Patel98, Patel99, Rotenberg99].

The Trace Cache architecture is an enhanced Superscalar architecture. Therefore, it has the same dynamic scheduling overheads of Superscalars (Subsection 2.1.2). These dynamic scheduling overheads can be particularly severe in aggressively parallel Superscalars, and may substantially lengthen their clock cycle time.

According to Hara et al. [Hara96], logic fan-out and wire delays are the most important scheduling overheads of aggressively parallel Superscalar-like machines. The main fan-out overheads are caused by the logic that forwards the functional units’ results to all instructions in the instruction window or reservation stations of the machine, the bypass logic. The main wire delay overheads are caused by the long wires necessary to connect these functional units to the various instructions in instruction window (or reservation stations), or bypass wire delay. In the near future, wire delays are expected to dominate the clock cycle time of Superscalar-like machines [Matzke97]. VLIW and DTSVLIW machines do not need hardware mechanisms equivalent to instruction windows or reservation stations in their main data path and do not suffer from their characteristic bypass logic and bypass wire delay overheads. Bypass logic and bypass wires are of course necessary in VLIW and DTSVLIW machines. However, they connect functional units’ outputs to functional units’ inputs only and not functional units’ outputs to several reservation stations at the input of each functional unit, or to all instructions of a large instruction window. Therefore, they can have a faster clock than Superscalar-like machines even considering wire delays [Hara96],

Palacharla, Jouppi, and Smith have studied the impact o f the complexity of the instruction dispatch and instruction issue hardware, and the impact o f the bypass logic (fan-out) and wire delay in the performance of future Superscalars [Palacharla97]. Like Hara and his colleges [Hara96], they have concluded that wire delays are going to dominate the clock cycle time of Superscalar machines. They have also concluded that the delays incurred by the wakeup logic and selection logic

may also impact the clock cycle time of Superscalars. The wakeup logic is responsible for matching the results produced by functional units with the source

operands of instructions waiting in the instruction window or reservation stations and for setting the instructions as ready. The selection logic is responsible for selecting instructions for execution from the pool of ready instructions. To reduce the impact of wire delays and wakeup and selection logic overheads, Palacharla, Jouppi, and Smith have suggested dividing the Superscalar core into several smaller clusters o f functional units (A similar proposal specifically tailored to Trace Cache architectures is reported in [Vajapeyam97].) The resulting architecture has been named the

Dependence-Based architecture. This architecture groups dependent instructions and sends them to the same cluster. This grouping of dependent instructions in clusters simplifies the wakeup and selection logic and helps mitigate the wire delays to some extent by using short local connections more frequently than long inter-cluster connections. The Dependence-Based architecture has inferior performance than non­ clustered Superscalar, however. Therefore, the DTSVLIW can compete in performance with this variant o f Superscalar as well. Moreover, we believe that clustering can also be employed in the DTSVLIW, although we do not examine clustered DTSVLIWs in this thesis.

F U F U F U F U F U Fill Unit Data Cache Trace Cache Instruction Cache Dispatch Hardware

IW & Issue Hardware

Main Memory

Figure 3.1: Trace Cache Architecture. IW stands for instruction window and FU for functional unit.