2. RESUMEN DE LA EJECUCIÓN DEL PROGRAMA OPERATIVO
2.7. Disposiciones en materia de seguimiento
2.7.3. Sistemas informáticos
There is a variety of simulators for shared-memory multi-core systems, offering dif- ferent tradeoffs between simulation speed and detail. In this section, we introduce several state-of-the-art multi-core simulators. Table2.1gives a quick overview of the different simulators.
TABLE2.1: Classification of shared-memory multi-core simulators
Name of Simulator Trace- / execution- driven Functional / performance simulator Instruction scheduler / Cycle timer
gem5 Both Both Instruction
scheduler
Graphite Execution
driven
Performance Cycle timer
Sniper Execution
driven
Performance Cycle timer
TaskSim Trace
driven
Performance Cycle timer
ZSim Execution
driven
Performance Cycle timer
COTSon Execution
driven
Both Cycle timer
ESESC Execution
driven
Performance Cycle timer
The gem5 simulator [14] is a full-system simulator, i.e., it models an entire com- puter system including devices like I/O controllers and system timers. This allows gem5 to run unmodified versions of different operating systems on the simulated hardware. Besides, gem5 features core models at several levels of detail, ranging from a model employing virtualization and running at near-native speed [104] to a detailed model of a superscalar out-of-order core. Amongst others, gem5 supports the x86 and ARM architectures, which are the most common architectures today.
Graphite [83] is a simulator for shared- and distributed-memory systems. It achieves high simulation speed by parallelizing a simulation across multiple cores of the host system, or even across multiple systems. Graphite uses dynamic binary translation to perform functional simulation of the simulated application.
The binary translator instruments all instructions of the simulated program and feeds each thread’s instruction stream to an analytical core performance model. Mem- ory requests from the application are serviced by Graphite’s simulated memory hi- erarchy. First, this provides the input to the performance models of the memory hierarchy, e.g. caches, on-chip interconnect and DRAM. Second, this approach de- couples the memory address space of the simulated system from the simulation host machine and allows to parallelize the simulation of a shared-memory system across multiple hosts of a distributed-memory system.
Sniper [21], proposed by Carlson et al., is a simulator for shared-memory sys- tems based on the Graphite simulator. Carlson et al. show that overly simplistic core performance models can introduce high simulation errors and extend Graphite by adding the interval model [48] as an improvement over Graphite’s core models processing a fixed number of instructions per cycle. These models are also referred to as fixed-IPC or one-IPC models, since they model program execution at an IPC of one.
The interval model allows to simulate processors with superscalar out-of-order execution, whereas the one-IPC model assumes in-order instruction issue and com- mit stages and a scalar execution pipeline. The interval model assumes out-of-order execution at the maximum steady-state IPC, which is interrupted by miss events. If during steady-state execution a branch predictor miss or a cache miss occurs, the model accounts for the number of cycles which are required to resolve the miss. Af- terwards, execution at steady-state IPC is resumed. Consecutive, dependent misses are accounted for separately. The higher level of abstraction of interval simulation is directly reflected in a higher simulation speed, compared to more detailed models.
TaskSim [99, 100] is a trace-based simulator, meaning that a trace of the simu- lated application is generated before simulation. This trace is afterwards used by all simulations involving the corresponding application. The TaskSim tracer traces the computation phases of an application and the parallelism management operations, e.g. work creation and scheduling primitives in the runtime system. This allows the tracer to be single-threaded, while a trace can be used to simulate the execution of the application with an arbitrary number of execution threads. Another advantage is that also the simulator can be a single-threaded process since it does not need to perform functional simulation of the simulated application. TaskSim interfaces with an unmodified instance of the OmpSs runtime system. TaskSim exposes the simu- lated cores to the runtime system, which then schedules work units for execution on those simulated cores. The instruction streams of these work units are read from the application trace.
2.3. Architectural Simulation 23
also referred to as Memory mode, models a superscalar processor core featuring out- of-order execution, based on the Reorder-Buffer Occupancy Analysis technique [77]. The core of this technique is a model of the reorder-buffer. According to the specified issue width of the simulated processor, a number of instructions is inserted into the head of the reorder-buffer in every cycle. If the reorder-buffer is full, the issue stage is halted. At the same time, instructions are committed from the tail of the reorder-buffer at a rate equivalent to the specified commit rate. Memory accesses are issued to an external model of the memory hierarchy, containing one or more levels of private cache, on-chip interconnect structures, shared caches, and DRAM.
In the abstract simulation mode, also called Burst mode, TaskSim employs a high- level core performance model. In Burst mode, computational phases are assumed to have the same duration as during trace generation. Optionally, these durations can be scaled by a user-defined factor. Microarchitectural core structures, as well as the components of the memory hierarchy, are not simulated. Therefore, Burst mode simulations do not capture contention on shared system resources. Instead, they allow evaluating an application’s algorithmic scalability limit and its best-case scalability, assuming that the application does not cause significant contention on shared resources.
The ZSim simulator [103], proposed by Sanchez et al., relies on parallel simula- tion in order to achieve high simulation speed. ZSim achieves good parallel simu- lation scalability by relaxing synchronization between simulated cores. To this end, simulated time is split into windows of typically 10,000 cycles. In each window, the different threads are simulated without synchronization, and a per-core event trace is generated.
At the end of each window, a dependency graph of all events is constructed, and a timing model is invoked in order to determine the actual interleaving of the per- core events. This timing model is also executed in parallel. The event dependency graph is partitioned into different domains, and the simulation is synchronized only in case of an event dependency crossing different domains.
Sanchez et al. report a simulation speed of 1,500 MIPS for simulations of a thousand-core system. Although ZSim shows absolute performance prediction er- rors of up to 20%, it achieves errors of less than 5% for scalability predictions of benchmarks of the PARSEC benchmark suite [13].
COTSon [6] is a full-system simulator decoupling functional and timing simula- tion. Functional simulation relies on just-in-time compilation of the simulated pro- gram. COTSon features simulation models at several levels of detail and supports sampling. Sampling reduces simulation time by simulating in detail only the repre- sentative phases of a program and is introduced in detail later in this chapter.
consumption and thermal behavior. ESESC, an extension of the SESC simulator, is the first simulator applying time-based sampling to the simulation of multi-threaded applications. We elaborate more on time-based sampling in Section2.4.2.