RESULTADOS DE LAS PRUEBAS
ILUSTRACIÓN 18 : RESULTADO PRUEBA
Our current implementation of interval simulation employs a function- al-first simulation approach. This means that the functional simula- tor generates a dynamic instruction stream, including user-level and system-level code, that is subsequently fed into the timing simulator. This implies that interval simulation does not simulate instructions along mispredicted paths, hence, it does not capture the positive or negative interference that wrong-path instructions may have on per- formance, e.g., through cache prefetching effects. Moreover, function- al-first interval simulation may lead to different thread interleavings than what may happen in real systems.
A more accurate approach is to build a timing-directed simulator in which the timing simulator directs the functional simulator along mis- predicted paths and determines thread interleavings. Unfortunately, timing-directed simulators are more difficult to develop. In our cur- rent implementation we opted for functional-first simulation because of its ease of development—this is a trade-off in development time, eval- uation time and accuracy—and our evaluation shows good accuracy against the cycle-accurate simulator. Implementing timing-directed in- terval simulation may be a course of future work.
benchmark input
blackscholes <cores> 4096
bodytrack sequenceB 1 4 1 1000 5 0 <cores>
canneal <cores> 10000 2000 100000.nets
dedup -c -p -f -t <cores> -i medias.dat -o output.dat.ddp
fluidanimate <cores> 5 in 35K.fluid out.fluid
streamcluster 10 20 32 4096 4096 1000 none output.txt <cores>
swaptions -ns 16 -sm 5000 -nt <cores>
vips im benchmark pomegranate 1600x1200.v output.v
x264 –quiet –qp 20 –partitions b8x8,i4x4 –bframes 3
–ref 5 –direct auto –b-pyramid –b-rdo –weightb –bime –mixed-refs –no-fast-pskip –me umh –subme 7 –analyse b8x8,i4x4 –threads <cores>
-o eledream.264 eledream 640x360 8.y4m
Table 6.1: The multithreaded PARSEC benchmarks and their reference inputs
used in the full-system simulation experiments.
An obvious limitation of interval simulation is that it does not model core-level performance in a detailed cycle-accurate way. Again, this choice is motivated by our goal to improve simulator development time and evaluation time.
Another limitation is that the timing of miss events may be different compared to real systems. In particular, interval simulation walks the pre-execution instruction window upon a long-latency load and sim- ulates all miss events that are independent of the long-latency load in the same cycle, which may be different from what may happen in a real system. This is a reasonable assumption, because the miss events that are independent of the long-latency load are hidden anyway.
6.3 Experimental setup
We use two benchmark suites, namely SPEC CPU2000 and PARSEC [5]. We use all of the SPEC CPU2000 benchmarks with the reference inputs in our experimental setup, see Table 4.1. The binaries of the CPU2000 benchmarks were taken from the SimpleScalar website; these binaries were compiled for Alpha using aggressive compiler optimizations. We considered one hundred million simulation points as determined by SimPoint [67] in all of our experiments in order to limit overall cycle- accurate simulation time—this is exactly the problem tackled by inter-
6.3 Experimental setup 111
Processor core
ROB 256 entries
issue queue 128 entries
load-store queue 128 entries
store buffer 64 entries
processor width decode, dispatch and commit 4 wide
issue 6 wide fetch 8 wide
functional units 4 integer, 4 load/store and 4 floating-point
functional unit latencies load (2), mul (3), fp (4), div (20)
fetch queue 16 entries
front-end pipeline depth 7 stages
branch predictor 12 Kbit local predictor,
8-way set-assoc 2 K-entry BTB, 32-entry RAS Memory subsystem
L1 I-cache 32 KB 4-way set-assoc 64 B line size
L1 D-cache 32 KB 4-way set-assoc 64 B line size
L2 cache unified, 4 MB 8-way set-assoc 64 B line size,
12 cycles access latency
coherence protocol MOESI
main memory 150 cycle access time
memory bandwidth 10.6 GB/s peak bandwidth
Table 6.2: Baseline processor core model assumed in our experimental setup;
simulated CMP architectures share the L2 cache.
val simulation. When simulating multiprogram workloads, we stop simulation as soon as one of the co-executing programs has executed one hundred million instructions.
In addition to the single-threaded user-level SPEC CPU bench- marks, we also use the multithreaded PARSEC benchmarks which spend a substantial fraction of their execution time in system code. We use 9 of the 13 PARSEC benchmarks that run on our simulator with the small input set, see Table 6.1, and run each benchmark to completion; the number of dynamically executed instructions per benchmark varies between five hundred million to thirteen billion instructions. The PAR- SEC benchmarks were compiled using the GNU C compiler for Alpha; we use aggressive optimization, including -O3, loop unrolling and software prefetching.
We use the M5 simulator [6] in all of our experiments. The SPEC CPU benchmarks are run in user-level simulation mode, and the PAR-
SEC benchmarks are run in full-system simulation mode.
Our baseline core microarchitecture is a 4-wide superscalar out-of- order core, see Table 6.2. When simulating a CMP, we assume that all cores share the L2 cache as well as the off-chip bandwidth for accessing main memory, and we assume a MOESI cache coherence protocol. For the multiprogram workloads, we were able to simulate up to 8 cores; physical memory constraints limited us from running larger multicore processor configurations. For the multithreaded workloads, we also run up to 8 cores; the benchmarks run on top of the 2.6.8.1 Linux kernel.
6.4 Evaluation
We now evaluate interval simulation in terms of development cost, ac- curacy and simulation speed. Accuracy is evaluated through a number of experiments, and we consider single-threaded workloads, homo- geneous and heterogeneous multiprogram workloads, multithreaded workloads, and a performance trend case study.