Balance del Sector De Las Frutas Y Hortalizas En Colombia

10. Estudio de Mercado

10.2 Balance del Sector De Las Frutas Y Hortalizas En Colombia

The Itanium enables instruction-level parallelism by letting the compiler/assembler ex- plicitly indicate parallelism by providing run-time support to execute instructions in parallel, and by providing a large number of registers to avoid register contention. First we discuss the instruction groups, and then see how the hardware facilitates parallel execution of instructions by bundling nonconﬂicting instructions together.

Itanium instructions are bound into instruction groups. An instruction group is a set of instructions that do not have conﬂicting dependencies among them (read-after-write or write-after-write dependencies, as discussed later on page 115), and may execute in parallel. The compiler or assembler can indicate instruction groups by using the;;notation. Let us look at a simple example to get an idea. Consider evaluating a logical expression consisting of four terms. For simplicity, assume that the results of these four logical terms are in registersr10,r11,r12, andr13. Then the logical expression in

if (r10 || r11 || r12 || r13) { /* if-block code */

}

can be evaluated usingor-tree reduction as

or r1 = r10,r11 /* Group 1 */

or r2 = r12,r13;;

or r3 = r1,r2;; /* Group 2 */

other instructions /* Group 3 */

The first group performs two paralleloroperations. Once these results are available, we can compute the final value of the logical expression. This final value inr3can be used by other instructions to test the condition. Inasmuch as we have not discussed Itanium instructions, it does not make sense to explain these instructions at this point. We have some examples in a later section.

In any given clock cycle, the processor executes as many instructions from one instruction group as it can, according to its resources. An instruction group must contain at least one instruction; the number of instructions in an instruction group is not limited. Instruction groups are indicated in the code by cycle breaks (;;). An instruction group may also end dynamically during run-time by a taken branch.

0 46

127 86 45 5 4

Instruction slot 1 Template 41 bits 41 bits 41 bits 5 bits Instruction slot 2 Instruction slot 0

Figure 7.3Itanium instruction bundle format.

An advantage of instruction groups is that they reduce the need to optimize the code for each new microarchitecture. Processors with additional resources can take advantage of the existing ILP in the instruction group.

By means of instruction groups, compilers package instructions that can be executed in parallel. It is the compiler’s responsibility to make sure that instructions in a group do not have conﬂicting dependencies. Armed with this information, instructions in a group are bundled together as shown in Figure 7.3. Three instructions are collected into 128-bit, aligned containers calledbundles. Each bundle contains three 41-bit instruction slots and a 5-bit template ﬁeld.

The main purpose of the template field is to specify mapping of instruction slots to execution instruction types. Instructions are categorized into six instruction types: integer ALU, non-ALU integer, memory, floating-point, branch, and extended. A specific execution unit may execute each type of instruction. For example, floating-point instructions are executed by the F-unit, branch instructions by the B-unit, and memory instructions such as load and store by the M-unit. The remaining three types of instructions are executed by the I-unit. All instructions, except extended instructions, occupy one instruction slot. Extended instructions, which use long immediate integers, occupy two instruction slots.

Instruction Set

As in the other chapters, we discuss several sample groups of instructions from the Itanium instructions set.

Data Transfer Instructions

The Itanium’s load and store instructions are more complex than those in a typical RISC processor. The Itanium supports speculative loads to mask high latency associated with reading data from memory.

The basic load instruction takes one of the three forms shown below depending on the addressing mode used:

(qp) ldSZ.ldtype.ldhint r1 = [r3] /* No update form */

(qp) ldSZ.ldtype.ldhint r1 = [r3],r2 /* Update form 1 */

(qp) ldSZ.ldtype.ldhint r1 = [r3],imm9 /* Update form 2 */

The load instruction loads SZbytes from memory, starting at the effective address. The SZcompleter can be 1, 2, 4, or 8 to load 1, 2, 4, or 8 bytes. In the ﬁrst load instruction,

Chapter 7 • Itanium Architecture 107

register r3provides the address. In the second instruction, contents ofr3and r2 are added to get the effective address. The third form uses a 9-bit signed immediate value, instead of register r2. In the last two forms, as explained earlier, the computed effective address is stored inr3.

Theldtypecompleter can be used to specify special load operations. For normal loads, the completer is not speciﬁed. For example, the instruction

ld8 r5 = [r6]

loads eight bytes from the memory starting from the effective address inr6. As mentioned before, the Itanium supports speculative loads. Two example instructions are shown below:

ld8.a r5 = [r6] /* advanced load */

ld8.s r5 = [r6] /* speculative load */

We defer a discussion of these load instruction types to a later section that discusses the speculative execution model of Itanium.

Theldhintcompleter speciﬁes the locality of the memory access. It can take one of the following three values.

ldhint Interpretation None Temporal locality, level 1

nt1 No temporal locality, level 1 nta No temporal locality, all levels

A prefetch hint is implied in the two “update” forms of load instructions. The address in r3 after the update acts as a hint to prefetch the indicated cache line. In the “no update” form of load, r3is not updated and no prefetch hint is implied. Level 1 refers to the cache level. Because we don’t cover temporal locality and cache memory in this book, we refer the reader to [6] for details on cache memory. It is sufﬁcient to view the ldhintcompleter as giving a hint to the processor as to whether a prefetch is beneﬁcial. The store instruction is simpler than the load instruction. There are two types of store instructions, corresponding to the two addressing modes, as shown below:

(qp) stSZ.sttype.sthint r1 = [r3] /* No update form */

(qp) stSZ.sttype.sthint r1 = [r3],imm9 /* Update form */

The SZcompleter can have four values as in the load instruction. Thesttypecan be none orrel. If the relvalue is speciﬁed, an ordered store is performed. Thesthint gives a prefetch hint as in the load instruction. However, it can be either none or nta. When no value is speciﬁed, temporal locality at level 1 is assumed. Thentahas the same interpretation as in the load instruction.

The Itanium also has several move instructions to copy data into registers. We describe three of these instructions:

(qp) mov r1 = r3

(qp) mov r1 = imm22 (qp) movl r1 = imm64

These instructions move the second operand into the r1register. The ﬁrst twomovin- structions are actually pseudoinstructions. That is, these instructions are implemented using other processor instructions. The movlis the only instruction that requires two instruction slots within the same bundle.

Arithmetic Instructions

The Itanium provides only the basic integer arithmetic operations: addition, subtraction, and multiplication. There is no divide instruction, either for integers or ﬂoating-point numbers. Division is implemented in software. Let’s start our discussion with the add instructions.

Add Instructions The format of the add instructions is given below:

(qp) add r1 = r2,r3 /* register form */

(qp) add r1 = r2,r3,1 /* plus 1 form */

(qp) add r1 = imm,r3 /* immediate form */

In theplus 1form, the constant 1 is added as well. In the immediate form,immcan be a 14- or 22-bit signed value. If we use a 22-bit immediate value,r3can be one of the ﬁrst four general registers GR0 through GR3 (i.e., only 2 bits are used to specify the second operand register as shown in Figure 7.2).

The immediate form is a pseudoinstruction that selects one of the two processor im- mediateaddinstructions,

(qp) add r1 = imm14,r3 (qp) add r1 = imm22,r3

depending on the size of the immediate operand size and value ofr3. The move instruction

(qp) mov r1 = r3

is implemented as

(qp) add r1 = 0,r3

The move instruction

(qp) mov r1 = imm22

is implemented as

(qp) add r1 = imm22,r0

Chapter 7 • Itanium Architecture 109

Subtract Instructions The subtract instruction subhas the same format as the add instruction. The contents of register r3are subtracted from the contents of r2. In the

minus 1form, the constant 1 is also subtracted. In the immediate form,immis restricted to an 8-bit value.

The instructionshladd(shift left and add)

(qp) shladd r1 = r2,count,r3

is similar to theaddinstruction, except that the contents ofr2are left-shifted bycount bit positions before adding. Thecountoperand is a 2-bit value, which restricts the shift to 1-, 2-, 3-, or 4-bit positions.

Multiply Instructions Integer multiply is done using thexmpyand xmainstructions. These instructions do not use the general registers; instead, they use the ﬂoating-point registers.

Thexmpyinstruction has the following formats.

(qp) xmpy.l f1 = f3,f4

(qp) xmpy.lu f1 = f3,f4

(qp) xmpy.h f1 = f3,f4

(qp) xmpy.hu f1 = f3,f4

The two source operands, floating-point registersf3andf4, are treated either as signed or unsigned integers. The completeruin the second and fourth instructions specifies that the operands are unsigned integers. The other two instructions treat the two integers as signed. Thelorhindicate whether the lower or higher 64 bits of the result should be stored in thef1floating-point register.

Thexmpyinstruction multiplies the two integers inf3andf4and places the lower or upper 64-bit result in thef1register. Note that we get a 128-bit result when we multiply two 64-bit integers.

Thexmainstruction has four formats as does thexmpyinstruction, as shown below:

(qp) xma.l f1 = f3,f4,f2

(qp) xma.lu f1 = f3,f4,f2

(qp) xma.h f1 = f3,f4,f2

(qp) xma.hu f1 = f3,f4,f2

This instruction multiplies the two 64-bit integers inf3andf4and adds the zero-extended 64-bit value inf2to the product.

Logical Instructions

Logical operationsand,or, andxorare supported by three logical instructions. There is nonotinstruction. However, the Itanium has anand-complement (andcm) instruction that complements one of the operands before performing the bitwise-andoperation.

All instructions have the same format. We illustrate the format of these instructions for theandinstruction:

(qp) and r1 = r2,r3 (qp) and r1 = imm8,r3

The other three operations use the mnemonicsor,xor, andandcm. Theand-complement instruction complements the contents ofr3andands it with the ﬁrst operand (contents ofr2or immediate valueimm8).

Shift Instructions

Both left- and right-shift instructions are available. The shift instructions

(qp) shl r1 = r2,r3 (qp) shl r1 = r2,count

left-shift the contents of r2by thecount value speciﬁed by the second operand. The countvalue can be speciﬁed inr3or given as a 6-bit immediate value. If thecount value inr3is more than 63, the result is all zeros.

Right-shift instructions use a similar format. Because right-shift can be arithmeti- cal or logical depending on whether the number is signed or unsigned, two versions are available. The register versions of the right-shift instructions are shown below:

(qp) shr r1 = r2,r3 (signed right shift)

(qp) shr.u r1 = r2,r3 (unsigned right shift)

In the second instruction, the completeruis used to indicate the unsigned shift operation. We can also use a 6-bit immediate value for shift count as in theshlinstruction.

Comparison Instructions

The compare instruction uses two completers as shown below:

(qp) cmp.crel.ctype p1,p2=r2,r3 (qp) cmp.crel.ctype p1,p2=imm8,r3

The two source operands are compared and the result is written to the two speciﬁed desti- nation predicate registers. The type of comparison is speciﬁed bycrel. We can specify one of 10 relations for signed and unsigned numbers. The relations “equal” (eq) and “not equal” (neq) are valid for both signed and unsigned numbers. For signed numbers, there are 4 relations to test for “<” (lt), “≤” (le), “>” (gt), and “≥” (ge). The corresponding relations for testing unsigned numbers areltu,leu,gtu, andgeu. The relation is tested as “r2 rel r3”.

Thectypecompleter specifies how the two predicate registers are to be updated. The normal type (default) writes the comparison result in thep1register and its complement in the p2register. This would allow us to select one of the two branches (we show an example on page 113). Thectypecompleter allows specification of other types such as andandor. Iforis specified, bothp1andp2are set to 1 only if the comparison result is 1; otherwise, the two predicate registers are not altered. This is useful for implementing or-type simultaneous execution. Similarly, ifandis specified, both registers are set to 0 if the comparison result is 0 (useful forand-type simultaneous execution).

Chapter 7 • Itanium Architecture 111

Branch Instructions

As in the other architectures, the Itanium uses branch instruction for traditional jumps as well as procedure call and return. The generic branch is supplemented by a completer to specify the type of branch. The branch instruction supports both direct and indirect branching. All direct branches are IP relative (i.e., PC relative). Some sample branch instruction formats are shown below:

IP Relative Form:

(qp) br.btype.bwh.ph.dh target25 (Basic form)

(qp) br.btype.bwh.ph.dh b1=target25 (Call form)

br.btype.bwh.ph.dh target25 (Counted loop form)

Indirect Form:

(qp) br.btype.bwh.ph.dh b2 (Basic form)

(qp) br.btype.bwh.ph.dh b1=b2 (Call form)

As can be seen, branch uses up to four completers. The btype speciﬁes the type of branch. The other three completers provide hints and are discussed later.

For the basic branch,btypecan be eithercondor none. In this case, the branch is taken if the qualifying predicate is 1; otherwise, the branch is not taken. The IP-relative target address is given as a label in the assembly language. The assembler translates this into a signed 21-bit value that gives the difference between the target bundle and the bundle containing the branch instruction. The target pointer is to a bundle of 128 bits, therefore the value (target25−IP) is shifted right by 4 bit positions to get a 21-bit value. Note that the format shown in Figure 7.2d uses a 21-bit displacement value.

To invoke a procedure, we use the second form and specifycallforbtype. This turns the branch instruction into a condition call instruction. The procedure is invoked only if the qualifying predicate is true. As part of the call, it places the current frame marker and other relevant state information in the previous function state application register. The return link value is saved in theb1branch register for use by the return instruction.

There is also an unconditional (no qualifying predicate) counted loop version. In this branch instruction (the third one), btypeis set tocloop. If the Loop Count (LC) application registerar65is not zero, it is decremented and the branch is taken.

We can use retas the branch type to return from a procedure. It should use the indirect form and specify the branch register in which the callhas placed the return pointer. In the indirect form, a branch register speciﬁes the target address. The return restores the caller’s stack frame and privilege level.

The last instruction can be used for an indirect procedure call. In this branch instruction, theb2branch register speciﬁes the target address and the return address is placed in theb1branch register.

(p3) br skip or (p3) br.cond skip

transfers control to the instruction labeledskip, if the predicate registerp3is 1. The code sequence

mov lc = 100

loop_back: . . .

br.cloop loop_back

executes the loop body 100 times. A procedure call may look like

(p0) br.call br2 = sum

whereas the return from proceduresumuses the indirect form

(p0) br.ret br2

Because we are using predicate register 0, which is hardwired to 1, both the call and return become unconditional.

Thebwh(branch whether hint) completer can be used to convey whether the branch is taken (see page 119). The ph(prefetch hint) completer gives a hint about sequential prefetch. It can take either fewor many. If the value is few or none, few lines are prefetched; many lines are prefetched whenmanyis specified. The two levels—fewand many—are system defined. The final completerdh(deallocation hint) specifies whether the branch cache should be cleared. The value clrindicates deallocation of branch information.

Handling Branches

Pipelining works best when we have a linear sequence of instructions. Branches cause pipeline stalls, leading to performance problems. How do we minimize the adverse effects of branches? There are three techniques to handle this problem.

• Branch Elimination: The best solution is to avoid the problem in the ﬁrst place. This argument may seem strange as programs contain lots of branch instructions. Al- though we cannot eliminate all branches, we can eliminate certain types of branches. This elimination cannot be done without support at the instruction-set level. We look at how the Itanium uses predication to eliminate some types of branches. • Branch Speedup: If we cannot eliminate a branch, at least we can reduce the amount

of delay associated with it. This technique involves reordering instructions so that instructions that are not dependent on the branch/condition can be executed while the branch instruction is processed. Speculative execution can be used to reduce branch delays. We describe the Itanium’s speculative execution strategies later.

Chapter 7 • Itanium Architecture 113

• Branch Prediction: If we can predict whether the branch will be taken, we can load the pipeline with the right sequence of instructions. Even if we predict correctly all the time, it would only convert a conditional branch into an unconditional branch. We still have the problems associated with unconditional branches. We described three types of branch prediction strategies in Chapter 2 (see page 28).

Inasmuch as we covered branch prediction in Chapter 2, we discuss the ﬁrst two techniques next.

Predication to Eliminate Branches In the Itanium, branch elimination is achieved by a technique known as predication. The trick is to make execution of each instruction conditional. Thus, unlike the instructions we have seen so far, an instruction is not auto- matically executed when the control is transferred to it. Instead, it will be executed only if a condition is true. This requires us to associate a predicate with each instruction. If the associated predicate is true, the instruction is executed; otherwise, it is treated as a nopinstruction. The Itanium architecture supports full predication to minimize branches. Most of the Itanium’s instructions can be predicated.

To see how predication eliminates branches, let us look at the following example.

if (R1 == R2) cmp r1,r2

In document Estudio de factibilidad para creación de una planta de transformación agroindustrial hortofrutícola en la sabana de Bogotá (página 50-56)