4. INTERPRETACIÓ DEL VALOR DE LA COMUNICACIÓ LOCAL EN EL DESENVOLUPAMENT DE LA IDENTITAT EBRENCA DESENVOLUPAMENT DE LA IDENTITAT EBRENCA
4.5 Estancament i crisi dels mitjans locals
Given the accelerator model and the basic architecture, we can now fill in the details of the xPU design. The meta-architecture provides us a framework for exploring a broad design space. This space includes high-level design issues, such as the memory and execution model. It includes the architecture and
microarchi-Figure 3.2: Basic architecture of the accelerator.
tecture of the compute array, the global cache, and the memory system. Finally, it includes the circuit and physical design of the chip.
Certain important design considerations are beyond the scope of the studies in this thesis. For instance, we do not address any details on the controller core.
Moreover, we do not deal with issues due to the connection between the accelerator and its memory with the rest of the system. These items are fundamentally performance overheads, and we assume they can be designed in a way as to minimize their impact. Similarly, we do not explore specific methods for doing thread scheduling, assuming that it can be done with minimal overhead. Previous work [12] [13] [14] has shown that this assumption is justified. We return to some of these issues in Chapter 10.
An important design consideration is the memory model. Multicore CPUs typically use a single, shared memory space. On the other hand, a number of existing accelerator architectures, including Cell and GPUs, use multiple memory spaces and scratchpad memories. Design decisions regarding the memory model include single versus multiple address spaces, the design of the memory hierarchy, and the bandwidth requirements. A related issue is the mechanisms provided for synchronization and communication. The xPU could include zero support for any sort of synchronization all the way up to support for fine-grained locking at a global level. Similarly, it could include support for absolutely no inter-thread communication (except through the CPU host), minimal support through memory, or extensive, high-bandwidth communication mechanisms on-chip. In Chapter 5, we examine the tradeoffs all of these design decision make on the execution of the VISBench applications.
The execution model of the compute array is another portion of the design space. In GPUs, the compute array is arranged in clusters of pipelines which execute in SIMD lockstep. In other architectures, such as TILERA or multicore
CPUs, the compute array consists of discrete cores which execute in a MIMD fashion. The SIMD approach allows the architecture to provide a higher density of peak throughput, at the cost of a loss in flexibility in execution between threads.
The tradeoffs of these approaches are examined in Chapter 6.
A further component of the design space for the compute array is the high-level architecture of the cores. The high-high-level, or macro-architecture, defines the algorithm of the core. It affects the way the core executes its instruction stream.
Elements of the macro-architecture include features of the overall pipeline design, such as superscalar issue and dynamic scheduling. They include decisions that are exposed to the programmer, such as multithreading, or instruction set changes like support for vector ALUs via SIMD instructions, as well as special floating point instructions for the complex functions that occur frequently in visual computing workloads. The tradeoffs in the core-level macro-architecture are examined in Chapter 7.
A level beneath the macro-architecture is the core micro-architecture, those aspects of the core design that may affect performance but not the overall mech-anism by which the instruction stream is executed. Microarchitectural design choices include pipeline depth and functional unit latencies. They include predic-tor sizing. Also, as we consider the lowest level caches to be part of the core, they include the size and latency of the private first level caches. The micro-architecture is closely coupled to the implementation. The implementation includes the logical design of the micro-architecture. It also includes logic styles, and the circuit de-sign. Finally, it includes the physical design, including circuit sizing and latency.
Chapters 8 and 9 examine the tradeoffs of micro-architecture and implementation, including IPC, clock speed, area, and power consumption.
Table 3.1 summarizes the design space for the xPU.
Design level Parameters memory model memory hierarchy
caching
synchronization mechanisms communication support execution model SIMD/MIMD
chip-wide architecture global cache bandwidth memory bandwidth core macroarchitecture superscalar issue
dynamic scheduling multithreading instruction set core microarchitecture pipeline latency
functional unit latency predictor sizing
L1 cache sizing L1 cache latency implementation logical design
logic style circuit design circuit sizing circuit latency
Table 3.1: Summary of the design space.
CHAPTER 4 METHODOLOGY
In Chapters 5 to 7, we will be dealing with the macro-architectural optimization of the accelerator. This will include higher-level issues of memory and execution model, as well as the more concrete architectural design of the core and memory system. We want to search the architectural design space of the accelerator to maximize performance given an area constraint.
In this Chapter, we describe our methodology for evaluating macro-architectural tradeoffs. As part of our methodology, we have developed a performance mea-surement methodology based on simulation. In addition we have developed a model to compute the area cost of a variety of architectural features by mapping out the required hardware at a micro-architecture and then logical level and then determining the area cost of each component.