CAPÍTULO IV. La polisemia en la lexicografía informatizada 85
4. Granularidad en los lexicones electrónicos 93
In this chapter, the instruction mov pr.rot = immed is used to initialize rotating predicates. This instruction ignores the value of CFM.rrb.pr. Thus, the examples in this chapter are written assuming that CFM.rrb.pr is always zero prior to the initialization of predicate registers using mov
pr.rot = immed.
11.4.3
Software-pipelined Loop Branches
The special software-pipelined loop branches allow the compiler to generate very compact code for software-pipelined loops by supporting register rotation and by controlling the filling and draining of the software pipeline during the prolog and epilog phases. Generally speaking, each time a software-pipelined loop branch is executed, the following actions take place:
1. A decision is made on whether or not to continue kernel loop execution.
2. p16 is set to a value to control execution of the stages of the software pipeline (p63 is written by the branch, and after rotation this value will be in p16).
3. The registers are rotated (rrb registers are decremented).
4. The Loop Count (LC) and/or the Epilog Count (EC) application registers are selectively decremented.
There are two types of software-pipelined loop branches: counted and while.
11.4.3.1
Counted Loop Branches
Figure 11-1 shows a flowchart for modulo-scheduled counted loop branches.
During the prolog and kernel phase, a decision to continue kernel loop execution means that a new source iteration is started. Register rotation must occur so that the new source iteration does not overwrite registers that are in use by prior source iterations that are still in the pipeline. p16 is set to 1 to enable the stages of the new source iteration. LC is decremented to update the count of remaining source iterations. EC is not modified.
During the epilog phase, the decision to continue loop execution means that the software pipeline has not yet been fully drained and execution of the source iterations in progress must continue. Register rotation must continue because the remaining source iterations are still writing results and the consumers of the results expect rotation to occur. p16 is now set to 0 because there are no more new source iterations and the instructions that correspond to non-existent source iterations must be disabled. EC contains the count of the remaining execution stages for the last source iteration and is decremented during the epilog. For most loops, when a software pipelined loop branch is executed with EC equal to 1, it indicates that the pipeline has been drained and a decision is made to exit the loop. The special case in which a software-pipelined loop branch is executed with EC equal to 0 can occur in unrolled software-pipelined loops if the target of the cexit branch is set to the next sequential bundle.
There are two types of software-pipelined loop branches for counted loops. br.ctop is taken when a decision to continue kernel loop execution is made, and is not taken otherwise. It is used when the loop execution decision is located at the bottom of the loop. br.cexit is not taken when a decision to continue kernel loop execution is made, and is taken otherwise. It is used when the loop execution decision is located somewhere other than the bottom of the loop.
11.4.3.2
Counted Loop Example
A conceptual view of a pipelined iteration of the example counted loop on page 11-1 with II equal to one is shown below:
stage 1:(p16) ld4 r4 = [r5],4
stage 2:(p17) --- // empty stage
stage 3:(p18) add r7 = r4,r9 stage 4:(p19) st4 [r6] = r7,4
To generate an efficient pipeline, the compiler must take into account the latencies of instructions and the available functional units. For this example, the load latency is two and the load and add are scheduled two cycles apart. The pipeline below is coded assuming there are two memory ports and the loop count is 200.
Note: Rotating GRs have now been included in the code (the code directly preceding did not). Also, induction variables that are post incremented must be allocated to the static portion of the register file:
mov lc = 199 // LC =loop count - 1
mov ec = 4 // EC =epilog stages + 1
mov pr.rot = 1<<16 ;;// PR16 = 1, rest = 0
Figure 11-1. ctop and cexit Execution Flow
000915 EC? LC? LC - - LC = LC LC = LC LC = LC EC = EC EC - - EC - - EC = EC PR[63] = 0 PR[63] = 0 PR[63] = 0 PR[63] = 1 RRB - - RRB - - RRB - - RRB = RRB ctop, cexit == 0 (epilog) ! = 0 > 1 == 0 ==1 (prolog / kernel)
(special unrolled loops)
ctop: branch cexit: fall-thru
ctop: fall-thru cexit: branch
L1:
(p16)ld4 r32 = [r5],4// Cycle 0 (p18)add r35 = r34,r9// Cycle 0 (p19)st4 [r6] = r36,4// Cycle 0
br.ctop L1 ;; // Cycle 0
The memory ports are fully utilized. Table 11-1 shows a trace of the execution of this loop.
In cycle 3, the kernel phase is entered and the fourth iteration of the kernel loop executes the ld4,
add, and st4 from the fourth, second, and first source iterations respectively. By cycle 200, all 200 loads have been executed, and the epilog phase is entered. When the br.ctop is executed in cycle 202, EC is equal to 1. EC is decremented, the registers are rotated one last time, and execution falls out of the kernel loop.
Note: After this final rotation, EC and the stage predicates (p16 – p19) are 0.
It is desirable to allocate variables that are loop variant to the rotating portion of the register file whenever possible to preserve space in the static portion for loop invariant variables. Induction variables that are post incremented must be allocated to the static portion of the register file.
11.4.3.3
While Loop Branches
Figure 11-2 shows the flowchart for while loop branches.
There are a few differences in the operation of the while loop branch compared to the counted loop branch. The while loop branch does not access LC — a branch predicate determines the behavior of this branch instead. During the kernel and epilog phases, the branch predicate is one and zero respectively. During the prolog phase, the branch predicate may be either zero or one depending on the scheme used to program the while loop. Also, p16 is always set to zero after rotation. The reasons for these differences are related to the nature of while loops and will be explained in more depth with an example in a later section.
Table 11-1. ctop Loop Trace
Cycle
Port/Instructions State before br.ctop
M I M B p16 p17 p18 p19 LC EC 0 ld4 br.ctop 1 0 0 0 199 4 1 ld4 br.ctop 1 1 0 0 198 4 2 ld4 add br.ctop 1 1 1 0 197 4 3 ld4 add st4 br.ctop 1 1 1 1 196 4 … … … … 100 ld4 add st4 br.ctop 1 1 1 1 99 4 … … … … 199 ld4 add st4 br.ctop 1 1 1 1 0 4 200 add st4 br.ctop 0 1 1 1 0 3 201 add st4 br.ctop 0 0 1 1 0 2 202 st4 br.ctop 0 0 0 1 0 1 ... 0 0 0 0 0 0