10. Estudio de Mercado
10.2 Balance del Sector De Las Frutas Y Hortalizas En Colombia
The Itanium enables instruction-level parallelism by letting the compiler/assembler ex- plicitly indicate parallelism by providing run-time support to execute instructions in par- allel, and by providing a large number of registers to avoid register contention. First we discuss the instruction groups, and then see how the hardware facilitates parallel execution of instructions by bundling nonconflicting instructions together.
Itanium instructions are bound into instruction groups. An instruction group is a set of instructions that do not have conflicting dependencies among them (read-after-write or write-after-write dependencies, as discussed later on page 115), and may execute in par- allel. The compiler or assembler can indicate instruction groups by using the;;notation. Let us look at a simple example to get an idea. Consider evaluating a logical expression consisting of four terms. For simplicity, assume that the results of these four logical terms are in registersr10,r11,r12, andr13. Then the logical expression in
if (r10 || r11 || r12 || r13) { /* if-block code */
}
can be evaluated usingor-tree reduction as
or r1 = r10,r11 /* Group 1 */
or r2 = r12,r13;;
or r3 = r1,r2;; /* Group 2 */
other instructions /* Group 3 */
The first group performs two paralleloroperations. Once these results are available, we can compute the final value of the logical expression. This final value inr3can be used by other instructions to test the condition. Inasmuch as we have not discussed Itanium instructions, it does not make sense to explain these instructions at this point. We have some examples in a later section.
In any given clock cycle, the processor executes as many instructions from one in- struction group as it can, according to its resources. An instruction group must contain at least one instruction; the number of instructions in an instruction group is not limited. Instruction groups are indicated in the code by cycle breaks (;;). An instruction group may also end dynamically during run-time by a taken branch.
0 46
87
127 86 45 5 4
Instruction slot 1 Template 41 bits 41 bits 41 bits 5 bits Instruction slot 2 Instruction slot 0
Figure 7.3Itanium instruction bundle format.
An advantage of instruction groups is that they reduce the need to optimize the code for each new microarchitecture. Processors with additional resources can take advantage of the existing ILP in the instruction group.
By means of instruction groups, compilers package instructions that can be executed in parallel. It is the compiler’s responsibility to make sure that instructions in a group do not have conflicting dependencies. Armed with this information, instructions in a group are bundled together as shown in Figure 7.3. Three instructions are collected into 128-bit, aligned containers calledbundles. Each bundle contains three 41-bit instruction slots and a 5-bit template field.
The main purpose of the template field is to specify mapping of instruction slots to execution instruction types. Instructions are categorized into six instruction types: integer ALU, non-ALU integer, memory, floating-point, branch, and extended. A specific execu- tion unit may execute each type of instruction. For example, floating-point instructions are executed by the F-unit, branch instructions by the B-unit, and memory instructions such as load and store by the M-unit. The remaining three types of instructions are executed by the I-unit. All instructions, except extended instructions, occupy one instruction slot. Extended instructions, which use long immediate integers, occupy two instruction slots.
Instruction Set
As in the other chapters, we discuss several sample groups of instructions from the Itanium instructions set.
Data Transfer Instructions
The Itanium’s load and store instructions are more complex than those in a typical RISC processor. The Itanium supports speculative loads to mask high latency associated with reading data from memory.
The basic load instruction takes one of the three forms shown below depending on the addressing mode used:
(qp) ldSZ.ldtype.ldhint r1 = [r3] /* No update form */
(qp) ldSZ.ldtype.ldhint r1 = [r3],r2 /* Update form 1 */
(qp) ldSZ.ldtype.ldhint r1 = [r3],imm9 /* Update form 2 */
The load instruction loads SZbytes from memory, starting at the effective address. The SZcompleter can be 1, 2, 4, or 8 to load 1, 2, 4, or 8 bytes. In the first load instruction,
Chapter 7 • Itanium Architecture 107
register r3provides the address. In the second instruction, contents ofr3and r2 are added to get the effective address. The third form uses a 9-bit signed immediate value, instead of register r2. In the last two forms, as explained earlier, the computed effective address is stored inr3.
Theldtypecompleter can be used to specify special load operations. For normal loads, the completer is not specified. For example, the instruction
ld8 r5 = [r6]
loads eight bytes from the memory starting from the effective address inr6. As mentioned before, the Itanium supports speculative loads. Two example instructions are shown be- low:
ld8.a r5 = [r6] /* advanced load */
ld8.s r5 = [r6] /* speculative load */
We defer a discussion of these load instruction types to a later section that discusses the speculative execution model of Itanium.
Theldhintcompleter specifies the locality of the memory access. It can take one of the following three values.
ldhint Interpretation None Temporal locality, level 1
nt1 No temporal locality, level 1 nta No temporal locality, all levels
A prefetch hint is implied in the two “update” forms of load instructions. The address in r3 after the update acts as a hint to prefetch the indicated cache line. In the “no update” form of load, r3is not updated and no prefetch hint is implied. Level 1 refers to the cache level. Because we don’t cover temporal locality and cache memory in this book, we refer the reader to [6] for details on cache memory. It is sufficient to view the ldhintcompleter as giving a hint to the processor as to whether a prefetch is beneficial. The store instruction is simpler than the load instruction. There are two types of store instructions, corresponding to the two addressing modes, as shown below:
(qp) stSZ.sttype.sthint r1 = [r3] /* No update form */
(qp) stSZ.sttype.sthint r1 = [r3],imm9 /* Update form */
The SZcompleter can have four values as in the load instruction. Thesttypecan be none orrel. If the relvalue is specified, an ordered store is performed. Thesthint gives a prefetch hint as in the load instruction. However, it can be either none or nta. When no value is specified, temporal locality at level 1 is assumed. Thentahas the same interpretation as in the load instruction.
The Itanium also has several move instructions to copy data into registers. We describe three of these instructions:
(qp) mov r1 = r3
(qp) mov r1 = imm22 (qp) movl r1 = imm64
These instructions move the second operand into the r1register. The first twomovin- structions are actually pseudoinstructions. That is, these instructions are implemented using other processor instructions. The movlis the only instruction that requires two instruction slots within the same bundle.
Arithmetic Instructions
The Itanium provides only the basic integer arithmetic operations: addition, subtraction, and multiplication. There is no divide instruction, either for integers or floating-point numbers. Division is implemented in software. Let’s start our discussion with the add instructions.
Add Instructions The format of the add instructions is given below:
(qp) add r1 = r2,r3 /* register form */
(qp) add r1 = r2,r3,1 /* plus 1 form */
(qp) add r1 = imm,r3 /* immediate form */
In theplus 1form, the constant 1 is added as well. In the immediate form,immcan be a 14- or 22-bit signed value. If we use a 22-bit immediate value,r3can be one of the first four general registers GR0 through GR3 (i.e., only 2 bits are used to specify the second operand register as shown in Figure 7.2).
The immediate form is a pseudoinstruction that selects one of the two processor im- mediateaddinstructions,
(qp) add r1 = imm14,r3 (qp) add r1 = imm22,r3
depending on the size of the immediate operand size and value ofr3. The move instruction
(qp) mov r1 = r3
is implemented as
(qp) add r1 = 0,r3
The move instruction
(qp) mov r1 = imm22
is implemented as
(qp) add r1 = imm22,r0
Chapter 7 • Itanium Architecture 109
Subtract Instructions The subtract instruction subhas the same format as the add instruction. The contents of register r3are subtracted from the contents of r2. In the
minus 1form, the constant 1 is also subtracted. In the immediate form,immis restricted to an 8-bit value.
The instructionshladd(shift left and add)
(qp) shladd r1 = r2,count,r3
is similar to theaddinstruction, except that the contents ofr2are left-shifted bycount bit positions before adding. Thecountoperand is a 2-bit value, which restricts the shift to 1-, 2-, 3-, or 4-bit positions.
Multiply Instructions Integer multiply is done using thexmpyand xmainstructions. These instructions do not use the general registers; instead, they use the floating-point registers.
Thexmpyinstruction has the following formats.
(qp) xmpy.l f1 = f3,f4
(qp) xmpy.lu f1 = f3,f4
(qp) xmpy.h f1 = f3,f4
(qp) xmpy.hu f1 = f3,f4
The two source operands, floating-point registersf3andf4, are treated either as signed or unsigned integers. The completeruin the second and fourth instructions specifies that the operands are unsigned integers. The other two instructions treat the two integers as signed. Thelorhindicate whether the lower or higher 64 bits of the result should be stored in thef1floating-point register.
Thexmpyinstruction multiplies the two integers inf3andf4and places the lower or upper 64-bit result in thef1register. Note that we get a 128-bit result when we multiply two 64-bit integers.
Thexmainstruction has four formats as does thexmpyinstruction, as shown below:
(qp) xma.l f1 = f3,f4,f2
(qp) xma.lu f1 = f3,f4,f2
(qp) xma.h f1 = f3,f4,f2
(qp) xma.hu f1 = f3,f4,f2
This instruction multiplies the two 64-bit integers inf3andf4and adds the zero-extended 64-bit value inf2to the product.
Logical Instructions
Logical operationsand,or, andxorare supported by three logical instructions. There is nonotinstruction. However, the Itanium has anand-complement (andcm) instruction that complements one of the operands before performing the bitwise-andoperation.
All instructions have the same format. We illustrate the format of these instructions for theandinstruction:
(qp) and r1 = r2,r3 (qp) and r1 = imm8,r3
The other three operations use the mnemonicsor,xor, andandcm. Theand-comple- ment instruction complements the contents ofr3andands it with the first operand (con- tents ofr2or immediate valueimm8).
Shift Instructions
Both left- and right-shift instructions are available. The shift instructions
(qp) shl r1 = r2,r3 (qp) shl r1 = r2,count
left-shift the contents of r2by thecount value specified by the second operand. The countvalue can be specified inr3or given as a 6-bit immediate value. If thecount value inr3is more than 63, the result is all zeros.
Right-shift instructions use a similar format. Because right-shift can be arithmeti- cal or logical depending on whether the number is signed or unsigned, two versions are available. The register versions of the right-shift instructions are shown below:
(qp) shr r1 = r2,r3 (signed right shift)
(qp) shr.u r1 = r2,r3 (unsigned right shift)
In the second instruction, the completeruis used to indicate the unsigned shift operation. We can also use a 6-bit immediate value for shift count as in theshlinstruction.
Comparison Instructions
The compare instruction uses two completers as shown below:
(qp) cmp.crel.ctype p1,p2=r2,r3 (qp) cmp.crel.ctype p1,p2=imm8,r3
The two source operands are compared and the result is written to the two specified desti- nation predicate registers. The type of comparison is specified bycrel. We can specify one of 10 relations for signed and unsigned numbers. The relations “equal” (eq) and “not equal” (neq) are valid for both signed and unsigned numbers. For signed numbers, there are 4 relations to test for “<” (lt), “≤” (le), “>” (gt), and “≥” (ge). The correspond- ing relations for testing unsigned numbers areltu,leu,gtu, andgeu. The relation is tested as “r2 rel r3”.
Thectypecompleter specifies how the two predicate registers are to be updated. The normal type (default) writes the comparison result in thep1register and its complement in the p2register. This would allow us to select one of the two branches (we show an example on page 113). Thectypecompleter allows specification of other types such as andandor. Iforis specified, bothp1andp2are set to 1 only if the comparison result is 1; otherwise, the two predicate registers are not altered. This is useful for implementing or-type simultaneous execution. Similarly, ifandis specified, both registers are set to 0 if the comparison result is 0 (useful forand-type simultaneous execution).
Chapter 7 • Itanium Architecture 111
Branch Instructions
As in the other architectures, the Itanium uses branch instruction for traditional jumps as well as procedure call and return. The generic branch is supplemented by a completer to specify the type of branch. The branch instruction supports both direct and indirect branching. All direct branches are IP relative (i.e., PC relative). Some sample branch instruction formats are shown below:
IP Relative Form:
(qp) br.btype.bwh.ph.dh target25 (Basic form)
(qp) br.btype.bwh.ph.dh b1=target25 (Call form)
br.btype.bwh.ph.dh target25 (Counted loop form)
Indirect Form:
(qp) br.btype.bwh.ph.dh b2 (Basic form)
(qp) br.btype.bwh.ph.dh b1=b2 (Call form)
As can be seen, branch uses up to four completers. The btype specifies the type of branch. The other three completers provide hints and are discussed later.
For the basic branch,btypecan be eithercondor none. In this case, the branch is taken if the qualifying predicate is 1; otherwise, the branch is not taken. The IP-relative target address is given as a label in the assembly language. The assembler translates this into a signed 21-bit value that gives the difference between the target bundle and the bundle containing the branch instruction. The target pointer is to a bundle of 128 bits, therefore the value (target25−IP) is shifted right by 4 bit positions to get a 21-bit value. Note that the format shown in Figure 7.2d uses a 21-bit displacement value.
To invoke a procedure, we use the second form and specifycallforbtype. This turns the branch instruction into a condition call instruction. The procedure is invoked only if the qualifying predicate is true. As part of the call, it places the current frame marker and other relevant state information in the previous function state application reg- ister. The return link value is saved in theb1branch register for use by the return instruc- tion.
There is also an unconditional (no qualifying predicate) counted loop version. In this branch instruction (the third one), btypeis set tocloop. If the Loop Count (LC) application registerar65is not zero, it is decremented and the branch is taken.
We can use retas the branch type to return from a procedure. It should use the indirect form and specify the branch register in which the callhas placed the return pointer. In the indirect form, a branch register specifies the target address. The return restores the caller’s stack frame and privilege level.
The last instruction can be used for an indirect procedure call. In this branch instruc- tion, theb2branch register specifies the target address and the return address is placed in theb1branch register.
(p3) br skip or (p3) br.cond skip
transfers control to the instruction labeledskip, if the predicate registerp3is 1. The code sequence
mov lc = 100
loop_back: . . .
br.cloop loop_back
executes the loop body 100 times. A procedure call may look like
(p0) br.call br2 = sum
whereas the return from proceduresumuses the indirect form
(p0) br.ret br2
Because we are using predicate register 0, which is hardwired to 1, both the call and return become unconditional.
Thebwh(branch whether hint) completer can be used to convey whether the branch is taken (see page 119). The ph(prefetch hint) completer gives a hint about sequential prefetch. It can take either fewor many. If the value is few or none, few lines are prefetched; many lines are prefetched whenmanyis specified. The two levels—fewand many—are system defined. The final completerdh(deallocation hint) specifies whether the branch cache should be cleared. The value clrindicates deallocation of branch in- formation.
Handling Branches
Pipelining works best when we have a linear sequence of instructions. Branches cause pipeline stalls, leading to performance problems. How do we minimize the adverse effects of branches? There are three techniques to handle this problem.
• Branch Elimination: The best solution is to avoid the problem in the first place. This argument may seem strange as programs contain lots of branch instructions. Al- though we cannot eliminate all branches, we can eliminate certain types of branches. This elimination cannot be done without support at the instruction-set level. We look at how the Itanium uses predication to eliminate some types of branches. • Branch Speedup: If we cannot eliminate a branch, at least we can reduce the amount
of delay associated with it. This technique involves reordering instructions so that instructions that are not dependent on the branch/condition can be executed while the branch instruction is processed. Speculative execution can be used to reduce branch delays. We describe the Itanium’s speculative execution strategies later.
Chapter 7 • Itanium Architecture 113
• Branch Prediction: If we can predict whether the branch will be taken, we can load the pipeline with the right sequence of instructions. Even if we predict correctly all the time, it would only convert a conditional branch into an unconditional branch. We still have the problems associated with unconditional branches. We described three types of branch prediction strategies in Chapter 2 (see page 28).
Inasmuch as we covered branch prediction in Chapter 2, we discuss the first two tech- niques next.
Predication to Eliminate Branches In the Itanium, branch elimination is achieved by a technique known as predication. The trick is to make execution of each instruction conditional. Thus, unlike the instructions we have seen so far, an instruction is not auto- matically executed when the control is transferred to it. Instead, it will be executed only if a condition is true. This requires us to associate a predicate with each instruction. If the associated predicate is true, the instruction is executed; otherwise, it is treated as a nopinstruction. The Itanium architecture supports full predication to minimize branches. Most of the Itanium’s instructions can be predicated.
To see how predication eliminates branches, let us look at the following example.
if (R1 == R2) cmp r1,r2