Evolución paro registrado 1999-2009
EXPOSICION DE MOTIVOS
2.7.4 R ESPUESTA A P REGUNTAS FORMULADAS
Table 4.3: Paper Category B
ID Title Ref.
B.I A Quantitative Study of Memory System Interference in Chip Multiprocessor Architectures
[69] B.II DIEF: An Accurate Interference Feedback Mechanism for Chip
Multiprocessor Memory Systems
[70]
Since this goal was difficult to achieve with the Interference Point (IP) mechanism from Paper A.III, we decided to try a different approach.
The main inspiration for this work was Mutlu and Moscibroda [102] and their definition that interference is the additional time the processor is stalled waiting for memory in the shared mode. Processor stall time due to memory depends on the ability of the process to tolerate memory latencies and utilize the parallelism of the memory system. Consequently, it is difficult to measure this quantity directly. Instead, we decided to define interference in terms of the the additional memory latency which can be measured directly. Furthermore, the total latency is the sum of the latency in each memory system unit which makes the units’ measurement techniques relatively independent.
We started working on the system that would eventually become the main con- tribution of Paper B.II, in the autumn of 2008. Although we quickly developed a prototype, its accuracy was poor and it proved difficult to find the cause of the inac- curacies. By December, it was clear that we needed to improve our understanding of the problem. To achieve this, we started to work on Paper B.I. In Paper B.I, we quantified the latency impact of each shared unit by comparing the latency of each memory request in the private and shared modes. As well as improving our understanding, this work helped us achieve synchronized measurements of shared and private mode memory requests.
When Paper B.I was finished, we continued work on Paper B.II. This time, progress was better and we managed to track down the major problems. These problems were either programming errors or related to combination effects between the shared cache and memory bus. In particular, shared cache writebacks occur at different points in the benchmarks execution in the private and shared modes. We finished Paper B.II in the beginning of summer 2009.
4.4
Category C: CMP Prefetch Scheduling
The feedback we got on the first submission of Paper A.II made it clear that the memory bus and DRAM model of the M5 simulator was too simplistic for our research topics. Concurrently, my fellow PhD student Marius Grannæs was porting his prefetcher implementations from SimpleScalar to M5. Grannæs had
Table 4.4: Paper Category C
ID Title Ref.
C.I Low-Cost Open-Page Prefetch Scheduling in Chip Multiprocessors [43] C.II Exploring the Prefetcher/Memory Controller Design Space: An
Opportunistic Prefetch Scheduling Strategy
-
already observed that the benefits of prefetching are closely tied to DRAM page locality [41]. Consequently, we decided to join forces and develop a detailed memory model based on the DDR2 standard document [71].
During the implementation of this model, we became interested in memory access scheduling. We started by implementing a simple First Come First Served (FCFS) scheduler and the First Ready - First Come First Served (FR-FCFS) scheduler by Rixner et al. [120]. While porting his prefetcher implementations to M5, Grannæs observed that cleverly scheduling prefetches and demand reads can improve per- formance. In Paper C.I, we piggybacked prefetches to open pages on demand reads to these pages which make prefetches cheaper than ordinary reads. We observed that prefetching improved performance as long as the accuracy of the prefetcher was above 38%. This threshold is found empirically and indicates the break-even point between the cost of prefetching and the cost of regular reads. In other words, it indicates the amount of useless data we can allow the prefetcher to fetch from open pages without degrading performance.
Paper C.II was born as an idea for a new prefetching heuristic. Grannæs observed that the state of the DRAM system could be used to generate prefetches that can be efficiently executed. This is the opposite approach to conventional prefetching heuristics which create prefetches based on the miss address stream. The key component of this system is the Page Vector Table (PVT) which contains one bit for each cache line in a DRAM page. While working on this idea, we realized that the PVT could be used as the interface between the prefetcher and the memory bus scheduler. In this system, the prefetcher sets the bits of the cache lines it wants to retrieve in the PVT which facilitate efficient prefetch scheduling. In our opportunistic prefetch scheduling strategy, we fetch all marked cache blocks in the PVT at the time the memory bus scheduler closes the page if the accuracy of the prefetcher is sufficiently high. Paper C.II also explores the prefetch scheduling design space, indicating that the opportunistic strategy has an advantage when bandwidth constrained CMPs are combined with aggressive prefetchers. At the time of writing, Paper C.II is being reviewed by the Journal of Computer Science and Technology.
In this thesis, we investigate resource management in CMP memory systems. While the contributions in categories A manage off-chip bandwidth to improve system- wide performance metrics, prefetching aims to put the available bandwidth to good use. Consequently, it provides more bandwidth to processes that have predictable memory access patterns. This may decrease the performance of processes with