Tendencias - Tecnolog´ıa de Objetos de Aprendizaje

2.3. Tecnolog´ıa de Objetos de Aprendizaje

2.3.7. Tendencias

In addition to removing branches through the use of predication, several mechanisms are provided to decrease the branch misprediction rate and the cost of the remaining mispredicted branches. These mechanisms provide ways for the compiler to

communicate information about branch conditions to the processor.

Branch predict instructions are provided which can be used to communicate an early indication of the target address and the location of the branch. The compiler will try to indicate whether a branch should be predicted dynamically or statically. The processor can use this information to initialize branch prediction structures, enabling good prediction even the first time a branch is encountered. This is beneficial for

unconditional branches or in situations where the compiler has information about likely branch behavior.

For indirect branches, a branch register is used to hold the target address. Branch predict instructions provide an indication of which register will be used in situations when the target address can be computed early. A branch predict instruction can also signal that an indirect branch is a procedure return, enabling the efficient use of call/return stack prediction structures.

Special loop-closing branches are provided to accelerate counted loops and modulo-scheduled loops. These branches and their associated branch predict

instructions provide information that allows for perfect prediction of loop termination, thereby eliminating costly mispredict penalties and a reduction of the loop overhead.

2.9 Register Rotation

Modulo scheduling of a loop is analogous to hardware pipelining of a functional unit since the next iteration of the loop starts before the previous iteration has finished. The iteration is split into stages similar to the stages of an execution pipeline. Modulo scheduling allows the compiler to execute loop iterations in parallel rather than sequentially. The concurrent execution of multiple iterations traditionally requires unrolling of the loop and software renaming of registers. The Itanium architecture allows the renaming of registers which provide every iteration with its own set of registers, avoiding the need for unrolling. This kind of register renaming is called register rotation. The result is that software pipelining can be applied to a much wider variety of loops – both small as well as large with significantly reduced overhead.

2.10 Floating-point Architecture

The Itanium architecture defines a floating-point architecture with full IEEE support for the single, double, and double-extended (80-bit) data types. Some extensions, such as a fused multiply and add operation, minimum and maximum functions, and a register file format with a larger range than the double-extended memory format, are also included. 128 floating-point registers are defined. Of these, 96 registers are rotating (not stacked) and can be used to modulo schedule loops compactly. Multiple

The Itanium architecture has parallel FP instructions which operate on two 32-bit single precision numbers, resident in a single floating-point register, in parallel and

independently. These instructions significantly increase the single precision

floating-point computation throughput and enhance the performance of 3D intensive applications and games.

2.11 Multimedia Support

The Itanium architecture has multimedia instructions which treat the general registers as concatenations of eight 8-bit, four 16-bit, or two 32-bit elements. These instructions operate on each element in parallel, independent of the others. They are useful for creating high performance compression/decompression algorithms that are used by applications which have sound and video. Itanium multimedia instructions are

semantically compatible with HP’s MAX-2* multimedia technology and Intel’s MMX and SSE technology instructions.

2.12 Intel® Itanium® System Architecture Features

2.12.1 Support for Multiple Address Space Operating Systems

Most contemporary commercial operating systems utilize a Multiple Address Space (MAS) model with the following characteristics:

Protection is enforced among processes by placing each process within a unique address space. Translation Lookaside Buffers (TLBs), which hold virtual to physical mappings, often need to be flushed on a process context switch.

Some memory areas may be shared among processes, e.g. kernel areas and shared libraries. Most operating systems assume at least one local and one global space. To promote sharing of data between processes, MAS operating systems aggressively use virtual aliases to map physical memory locations into the address spaces of multiple processes. Virtual aliases create multiple TLB entries for the same physical data leading to reduced TLB efficiency.

The MAS model is supported by dividing the virtual address space into several regions. Region identifiers associated with each region are used to tag translations to a given address space. On a process switch, region identifiers uniquely identify the set of translations belonging to a process, thereby avoiding TLB flushes. Region identifiers also provide a unique intermediate virtual address that help avoid thrashing problems in virtual-indexed caches and TLBs. Regions provide efficient global/shared areas between processes, while reducing the occurrences of virtual aliasing.

2.12.2 Support for Single Address Space Operating Systems

A single address space (SAS) operating system style architecture is the basis for much of the current design work on future 64-bit operating systems. As operating systems (and other large, complex programs like databases) migrate from monolithic programs

into cooperating subsystems, an SAS architecture becomes an important performance differentiation in future systems. The SAS or hybrid environments enable a more efficient use of hardware resources.

Common mechanisms are used in both SAS and MAS models such as page level access rights to enforce protection, although the reliance on the feature set will differ under each model. While most of the architected features are utilized in each model, protection keys exist to enable a single global address space operating environment.

2.12.3 System Performance and Scalability

Performance and scalability are achieved through a variety of features. Memory attributes, locking primitives, cache coherency, and memory ordering model work together to allow the efficient sharing of data in a multiprocessor environment. In addition, the Itanium architecture enables low latency fault, trap, and interrupt

handlers along with light-weight domain crossings. Performance analysis is aided by the inclusion of several performance monitors, and mechanisms to support software profiling.

2.12.4 System Security and Supportability

Security and supportability result from a number of primitives which provide a very powerful runtime and debug environment. The protection model includes four protection rings and enables increased system integrity by offering a more

sophisticated protection scheme than has generally been available. The machine check model allows detailed information to be provided describing the type of error involved and supports recovery for many types of errors. Several mechanisms are provided for debugging both system and application software.

2.13 Terminology

This following terms are used in the remainder of this document:

• Itanium Instruction Set – The Itanium architecture defines the 64-bit instruction set extensions to the IA-32 architecture.

• IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the

Intel®_{64 and IA-32 Architectures Software Developer’s Manual.}

• Itanium System Environment – System environment that supports the execution of both IA-32 and Itanium architecture-based code.

• Platform – Application and operating system resources external to the processor such as: memory maps, external devices (e.g. DMA), keyboard controllers, buses (e.g. PCI), option cards, interrupt controllers, bridges, etc.

• Itanium architecture-based Firmware – The Processor Abstraction Layer (PAL) and System Abstraction Layer (SAL).

• Processor Abstraction Layer (PAL) – The firmware layer which abstracts processor features that are implementation dependent.

• System Abstraction Layer (SAL) – The firmware layer which abstracts platform features that are implementation dependent.

Execution Environment

3

In document Automatización de los procesos para la generación, ensamblaje y reutilización de objetos de aprendizaje (página 88-96)