• No se han encontrado resultados

Loop pipelining and loop unrolling are a TRICs that can be applied when algorithmic loops are the bottleneck for a system. I illustrate the challenge to be overcome in pipelining loops with the simple code example shown in Figure5.11. This example represents a generic code fragment that iterates on every data set using a for loop. In the example, the symbols s1

throughs8represent statements.

A direct translation of this code into data-driven hardware typically yields the schematic structure shown in Figure 5.12. Each statement in the code—s1 through s8—becomes a pipeline stage. Statements s3 through s6 along with the for statement compose the loop block. During operation, the environment supplies data sets to the system via the read block and receives the outputs as they are completed via the write block.

A key observation is that the loop block has the same external interface as an individ- ual pipeline stage, even though internally it contains an entire ring for iterative computation. Specifically, theloop block accepts one data set from the predecessor stage, performs a cal- culation, and passes the computed result on to a successor stage. It doesnot accept new data until the results of the calculation have been accepted by the successor stage.

Interestingly, the presence of a loop in the compute body has the same effect on perfor- mance as a single pipeline stage with long latency—it limits the throughput of the pipeline as a whole. Figure5.12 illustrates this bottleneck scenario by indicating the presence of data with a dot. Stages downstream from theloop block are idle while the stages upstream from

func compute(input) s1; s2; for i = 1 to N s3; s4; s5; s6; end s7; s8; return(result)

Figure 5.11: Sample code for afor loop

s1 for N s5 s4 s3 s6 s2 s7 s8 loop block read() write()

Figure 5.12: A simple implementation of the

computefunction s2 i < N s3 s4 s5 i++ s6 s7 s1 interface s8

Figure 5.13: A loop pipelined implemen- tation of thecomputefunction

theloop blockare stalled.

Loop Pipelining Approach

The key difference between my approach and previous work is to allow multiple data sets into the ring concurrently. Allowing multiple data sets to enter reduces pipeline stalls in the stages before the loop and reduces starvation in the stages after the loop. The net effect is increased throughput for the system as a whole. Figure5.13shows the basic concept behind transforming the code of Figure5.11 into a pipelined loop that can hold multiple data sets. In this example, the loop body holds three data sets, thereby reducing stalling before the loop and starvation after the loop.

Allowing multiple data sets within the ring concurrently presents three challenges. i) Flow control - managing the entry and exit of data sets in the ring and preventing collisions; ii) congestion prevention - preventing the ring from becoming over full, which leads to slowing down or stalling; iii) Data hazard prevention - allowing concurrency while preventing separate data sets from overwriting each other’s values.

if

if

arbiter arbiter counter counter bouncer bouncer ’—Ž›ŠŒŽ ’—Ž›ŠŒŽ

Figure 5.14: My method of loop flow control.

if

if

arbiter

arbiter

Figure 5.15: Congestion prevention for pipelined loops.

Flow Control. I resolve challenge of flow control by adding two stages to the interface of the loop. A conditional split (i.e. a pipelined if construct) and an arbitration stage. The arbiter prevents newly entering data from colliding with data already cycling in the ring. The if checks the boolean condition and sends the data out of the ring or back in through the arbiter. Together, these two stages of the ring interface manage the entry and exit of data sets and prevent collisions. Figure5.14shows the use of these two stages.

Ring Congestion. Addressing the challenge of ring congestion requires some information about the performance of the ring. In particular, as described in Section2.2.2, every ring has some ideal occupancy (i.e.number of data sets) at which it attains maximum throughput. This ideal occupancy can be found through timing analysis of the ring or through simulation.

Two stages in the loop interface are used to enforce the ideal occupancy : the bouncer and the counter. The counter dynamically keeps track of the number of data sets within the ring; it is notified whenever a dataset leaves and whenever one enters. The bouncer uses the information from the counter to determine if the ring is congested. It only allows a new data set to enter when the ring is not congested. Figure5.15shows these two stages added to the interface with arrows to indicate communication between them.

Data Hazards. In order to prevent data sets from interfering with each other (i.e. data hazards) I replicate some of the variables to ensure sufficient storage when operating on many concurrent data sets. In particular, each stage must store its own copy of the context—the

func compute (b,c) var a, i; for i = 1 to N a = b*2; b = a+c; end return t6

Figure 5.16: Sample code with feedback

interface i = 1 b b c c b b c c i i c c bb i i

*

*

i i ccbbaa c c i i b b i i c c b b <N <N

++

+1 +1 i i c c b b b c b c b c i b c i a b c i a b c i b c i b c i b c i b c i

Figure 5.17: Preventing data hazards in pipelined loops.

collection of all variables needed by that stage or one of its successor stages. For example, the code in Figure5.16contains a loop that uses four different pieces of data: a, b, c, andi. Figure5.17shows a loop pipelined implementation of this code. Notice that since the counting variable, i, can have a different value for each data set, the value is stored in every stage to prevent collisions. Notice also that the valueais not stored in every stage; it is produced in stage 2, consumed in stage 3, and not in any subsequent stages.

More formally, I solve the problem of data hazards by storing the context for each data set locally in each stage’s storage element, and not in any centralized memory location. I determine which variables need to be latched in each stage usinglive variable data flow anal- ysis—a method for determining which values may be used later and which can be discarded. In particular, the IN set of each stage is the set of data values that are required to go into a code fragment either because they may be used inside, or because they may need to be relayed to a successor of that fragment. Each stage must store its IN set, so all variables which are live at that stage have a storage space.

Loop Unrolling

Unrolling the loop body to form a ring with a greater number of stages can greatly improve performance. Intuitively, this improvement results from the duplication of hardware inside the loop and a corresponding increase in the ring occupancy. Thus, the unrolled loop is able to

perform more “work” per unit time. More formally, the ring frequency (at ideal occupancy, cf.Equation2.7) remains fairly unchanged when the loop is unrolled, but every “tick” at the loop’s interface now represents a greater amount of work completed. In particular, if the loop is unrolledutimes, every time a data set crosses the interface stage, it indicates thatuiterations have just been completed on that data set, rather than just one. Therefore, ignoring overheads, the loop’s effective computation throughput increases by a factor equal to the number of times it is unrolled,u.

As a second-order effect, loop unrolling actually also has the benefit of somewhat reducing the overhead of the special-purpose “helper” stages: the bouncer, the counter and the arbiter. That overhead is now amortized over a larger ring. As a result, the latency of each data set will tend to somewhat decrease and hardware utilization will slightly increase. One possible negative effect of loop unrolling is that it can cause some data sets to be iterated over more times than necessary, thereby requiring extra checks within the unrolled loop to preserve the semantics of the computation.

Although loop unrolling is a common technique in both software and hardware optimiza- tion, goals and performance effects are generally different. In software compilers, the primary benefit of loop unrolling is to introduce more room to allow instructions to be reordered, with the purpose of reducing stalls due to branch and data hazards. In hardware translation ap- proaches, such as [8, 63], loop unrolling is used in conjunction with compaction to increase concurrency within the loop body. However, these approaches do not allow an increase in the occupancy of the loop, thereby obtaining limited throughput improvement. In contrast, unrolling pipelined loops increases the loop occupancy by the same factor as the number of times it is unrolled, thereby obtaining dramatically higher speedup.

Performance Benefit and Overheads

Previous work on performance analysis of rings allows us to predict the speedup obtained by the use of our method. As discussed in Section2.2.2 and shown in Figure 2.6, the ring

frequencyis proportional to the total number of data sets that are revolving inside the ring, as long as the ring is not congested (i.e.,“hole limited”). Therefore, the maximum speedup of our approach is proportional to the ideal ring occupancy.

The speed improvement of our method is largely due to improved hardware utilization. A ring that holds only one data set has a high amount of unused hardware at any given time. By allowing multiple data sets, we are able to obtain high hardware utilization from the compo- nents within the ring.

Our approach adds some overhead that decreases the actual speedup and increases total area. Certain “helper” stages—such as the bouncer, counter and arbiter—increase the latency of the ring and add a constant area overhead for each loop. Also, any counter implementation must have some latency and therefore will not allow new data to enter immediately after old data leaves. The effect is that the ring may be operating at an occupancy that is, on average, slightly lower than the nominal occupancy. This results in a decrease in throughput only if this slightly lower occupancy has a throughput lower than the maximum. The most notable increase to area is the extra storage elements that are required in order to hold the entire context at each ring stage. This overhead is necessary to allow each data set to hold its own copy of the loop’s context, thereby enabling multiple data sets to coexist independently within the ring.

Documento similar