P REGUNTAS QUE SE FORMULAN - EXPOSICION DE MOTIVOS

Evolución paro registrado 1999-2009

EXPOSICION DE MOTIVOS

2.7.1 P REGUNTAS QUE SE FORMULAN

Each core may have its own prefetcher in CMPs. Allowing these prefetchers to operate in an uncontrolled fashion may create significant interference with both demand and prefetch accesses of other cores. Ebrahimi et al. [31] provides a scheme for reducing prefetcher-caused interference by throttling the prefetchers depending on the state of the memory system. This throttling is based on both local and global feedback. The global feedback is used to avoid making decisions that are good from a local point of view but will reduce system performance.

Lee et al. [86] provides a different way of integrating prefetchers in a CMP with no shared cache. Here, they choose prefetch requests to maximize Memory Level Parallelism (MLP). In modern DRAM systems, the amount of MLP is closely tied to how efficiently the process is able to utilize the available DRAM banks. Con- sequently, Lee et al. chooses prefetches to maximize the bank parallelism of the concurrent requests. Furthermore, they load the concurrent requests of one proces- sor into the memory bus queue at the same time to minimize request serialization.

Chapter 3

Methodology

This chapter discusses the experiment methodology used in the papers included in this thesis. This methodology is used to quantify the effects of our contributions. Section 3.1 explains why a simulator-based methodology is used and the reasons for choosing the M5 simulator [11]. Then, benchmarks are discussed is Section 3.2. In Section 3.3, multiprogrammed workload generation is discussed. In addition, this section contains an analysis of how accurate multiprogrammed metric results can be provided. Finally, our use of compute clusters for Design Space Exploration (DSE) is discussed in Section 3.4.

3.1 Simulators

Computer architectures can be evaluated in three main ways [129]: • Performance measurement on real hardware

• Simulation

• Analytical modeling

In this thesis, we investigate new hardware techniques. Unfortunately, a significant effort is involved in implementing these techniques in real hardware. A simulator- based approach is more efficient since it enables rapid iterations through the im- provement and evaluation loop. Furthermore, modern simulators have a sufficient level of detail to make the effects our techniques aim to alleviate observable. Ana- lytical modeling has a significant advantage for exploration of large design spaces and early studies of future technologies that are very different from current simulation models [129]. Since our research focuses on architectures that are similar to current CMPs, the current simulators serve our purpose.

There is no shortage of computer architecture simulators. Therefore, finding the most suitable simulator can be a challenging task. Previously, our research group have used the SimpleScalar simulator [7, 27]. Unfortunately, SimpleScalar does not support simulating CMPs without modifications. Furthermore, memory latencies are calculated in a single operation which makes it difficult to model request interleaving and queuing effects. Consequently, we started to look for a SimpleScalar replacement. An important step in this process was Lande’s master thesis [84] where he evaluated Rsim [57], Asim [33], SimOS [121], Simics [95], TFSim [98], SimFlex [49], GEMS [97] and M5 [11]. Then, he carried out a thorough evaluation of M5 to establish if it met the needs of the research group. In the end, we decided to use M5 since it offered CMP support and an event-driven memory hierarchy. An event-driven memory hierarchy makes it possible to accurately model queuing and interleaving of memory requests which is a central theme in this thesis. M5 is an execution-driven simulator which makes it possible to capture dynamic interactions between instructions and memory requests. In an execution-driven simulator, a benchmark binary is used to drive the simulated CMP. Alternatively, a trace-driven simulator uses a trace of the executed instructions or memory requests to drive the simulator model. Furthermore, M5 supports both system call emulation and full-system simulation. With full-system simulation, the simulator runs an Operating System (OS). In contrast, all system calls are handled by the host OS with system call emulation. Although full-system simulation is more realistic, it also makes it difficult to find the cause of the observed behavior. Therefore, we use system call emulation in this thesis and leave full-system evaluation as further work.

Choosing system call emulation makes running multi-threaded benchmarks compli- cated. The reason is that communication libraries often have significant interaction with the OS. Consequently, it is likely that a large number of system calls would need to be implemented. Full-system simulation avoids this problem because the simulated OS provides these features. System call emulation also makes it challenging to adopt new benchmarks and compilers since they often require new system calls.

Although M5 was well suited to the needs of the research group, we had to im- plement significant extensions for it to fit our needs. Firstly, we have replaced the on-chip bus model with a range of different interconnect topologies. Sec- ondly, the simple off-chip memory bus model has been replaced by a detailed DDR2 model and various memory bus schedulers (FCFS, FR-FCFS [120] and NFQ [119]). Thirdly, we have implemented multi-banked shared caches, an Auxiliary Tag Directory (ATD) [30, 117] and MTP cache partitioning [18]. Fourthly, we have ex- tended M5 to collect basic block vectors for SimPoints [48] as well as improving the checkpointing support. Finally, we have developed a large number of Python scripts that help us run our experiment and analyze the results.

In document Boletín Oficial de la Asamblea de Madrid (página 55-61)