IX BIBLIOGRAFÍA GENERAL

DERECHOS DEL ADULTO EN SU PROCESO

Cycle -level simulators for modern multi-core CPUs can be distinguished along the following general execution characteristics:

• trace-driven vs execution-driven

• clocked simulators vs discrete event simulators • accelerated simulation availability

In addition to differences in the general simulation methodology, the simulated models may differ. The main components we are looking at are the CPU and the interconnect / memory hierarchy. CPU core models range from non-timing models, fixed-IPC models over pipelined in-order core models to very detailed out-of-order core models. Memory systems generally vary along the number of supported agents (single-core vs multi-core), coherency protocol (fixed vs variable), and topology (fixed vs flexible). Sim- ulation fidelity in memory hierarchies differs with respect to modelling both latencies (fixed vs actual queueing) and bandwidths (fixed vs limited bandwidth and back-pressure). Finally, one significant im- portance for memory hierarchies is whether they actually send and store data in memory requests and caches, or whether those structures are used exclusively for modelling placement, timing and bandwidth of requests.

In this work, I have extended the PTLsim and Marss86 simulators [135, 253], with the latter derived from the former. They both feature a very detailed out-of-order core model with pipeline stages, reserva- tion stations, replay, speculation, etc., faithfully modelling a modern out-of-order core microprocessor.

They also support seamless switching between fast emulation (Xen / QEMU) and detailed simulation. In addition to full-system simulation, both offer user-level only simulation, but due to limitations (no OS involvement, paging, interrupts, and deficits in multi-threading), I have not made extensive use of this feature, but instead focus on full-system execution.

Simulation is driven by time, that means every component of the simulator is clocked on every cycle and consults its data structures to see if waiting entries are ready to progress to the next stage of execution, need sending on the bus etc.

Both simulators decode both user-level and privileged instructions of the AMD64 instruction set and break them up into micro-ops that will flow through the pipeline. The various pipeline stages can process multiple instructions per cycle (super-scalar) and issue / execution of µops can start / complete out-of- order. A reorder buffer (ROB) keeps track of in-flight instructions; and registers are handled through a standard renaming table.

The core model consists of a variety of functional units with easily configurable execution capacity. Mops can be specified to require specific functional units for a specific number of cycles, and communication latencies for the bypass networks between functional units can be customised, too.

Memory instructions flow through a load-store-queue and can also be executed out-of-order aggres- sively. The memory hierarchy will then dictate how many cycles a load / store requires.

Aside from general infrastructure differences between PTLsim and Marss86 (Xen vs QEMU), the memory hierarchy is the main difference between the two simulators. PTLsim consists of a single cache hierarchy separate per core, simulating L1i, L1d, L2 and L3 caches (physically indexed and tagged, inclusive strictly per-core. These are connected to a fixed latency DRAM. By default, caches were not coherent, i.e., entries could be present in multiple caches in conflicting states. Marss86 heavily extended this memory model through earlier work on MPTLsim which appeared in [208]. This improved cache model adds the following new features to the already very detailed core model from PTLsim: proper bandwidth modelling of caches and interconnects, coherency messages which ensure coherence and model delays due to interactions in the coherency protocol, multiple coherency protocols (I used the MOESI configuration), directories to keep track of cache state in specific core-local caches, and large shared caches.

Both memory subsystems rely on a single flat physical backing store to simplify the correctness of the coherency protocol. This means that in addition to the timing interactions (querying caches, sending out requests, etc.), loads and stores will access the single shared flat memory to read / write the data in question. This will ensure that for every address there is always only a single value that corresponds to the most recently written version of this address, irrespective of the coherence protocol actually implemented in the timing layer.

This simplification guarantees basic coherence properties basically for free; greatly reduces burden on the coherence protocol implementation and the verification. It does, however, also lead to unfortu- nate side-effects in specific timing conditions that cause very unrealistic simulation of tight intra-thread communication. I will explore his is issue further in Section 5.4.4.

In earlier work, I configured and tweaked the core and (single-core) memory hierarchy to produce timing / performance results similar to an AMD K8 CPU [158, 159]. For this thesis I updated the adapta- tions so that the core matches an AMD Family 10h processor core (AMD Phenom / Phenom II).

Detailed Load Path Walkthrough The load path of PTLsim is relatively straight-forward, and its key code pieces are depicted in Figure 5.15. When a load µop (represented as a ROBEntry–reorder-buffer entry) gets to the issue stage, like all instructions, it checks that all its input operands (used for address calculation) are available. It then computes the effective virtual address, and translates it to physical addresses. Immediately after, the simulator loads the data from the single global physical address view (with loadphys).

Only after that, the various timing conditions are checked (is there an L1 data cache bank conflict, are there any unresolved or overlapping earlier stores, are there any held bus locks). Eventually, the load queries the TLB for a present translation entry and subsequently probes the L1 data cache directly.

Depending on the hit / miss information, the load is put on the issued / miss list and is assigned the appropriate latency value. The global PTLsim clock function clocks all components and waiting misses decrement their remaining cycles. If they hit zero, a wake-up callback marks the destination register ready so that depending instructions can consume the value. While there is a separation between the core and the memory hierarchy, for simulator efficiency, L1 cache hits are treated through fast-path logic in the simulator. The slow-path of the memory hierarchy is shown in Figure 5.16 and details the rather explicit state machine and fixed layout of the cache hierarchy.

In the original PTLsim, multiple cores had completely independent cache hierarchies. Thanks to the data-less cache hierarchy, this was not a problem, as loads and stores all hit the same single global physical data store directly (and in sequence). For simulating the first-order effects of coherence, I added invalidations and cache-to-cache transfers by looking up in caches of the neighbouring hierarchies (in probe_other_caches) and invalidating their entries (in zero time).

1 voidROBEntry::issue() {

2 fuinfo[opcode] & cluster.fu_mask & fu_avail

3 /* load data from source registers */

4 if(ld) issueload (..);

5 if(asf) asf_pipeline_intercept . issue (..);

6 }

8 voidROBEntry::issueload(..) { 9 physaddr = addrgen(..); 10 data = loadphys(physaddr);

11 asf_pipeline_intercept .issue_probe_and_merge(physaddr, data);

12 /* Check store queue for earlier overlapping stores and fences */

13 /* Perform forwarding if possible , replay otherwise */

14 asf_pipeline_intercept .issue_load (..);

15 /* Check L1 bank conflicts */

16 /* Allocate entry on load queue */

17 /* Check bus lock of others , acquire if necessary */

18 /* Handle misaligned / small loads: merge two parts, shift data */

19 caches.dtlb .probe(addr); 20 probecache(addr);

21 }

23 voidROBEntry::probecache(..) {

24 boolhit = caches.probe_cache_and_sfr(addr); 25 if( hit ) { 26 cycles_left = LOADLAT; 27 changestate(rob_issued_list ); 28 if( invalidating ) caches.probe_other_caches(addr); 29 return; 30 } 31 changestate(rob_cache_miss_list); 32 caches.issueload_slowpath(addr); 33 if( invalidating ) caches.probe_other_caches(addr); 34 } 35 36 voidOOOCore::dcache_wakeup(resp) { 37 rob = resp.rob; 38 rob−>physreg−>complete(); 39 rob−>lsq−>datavalid = 1; 40 }

Figure 5.15: Key core-side code used in the PTLsim load path.

Finally, the code excerpt shows the various integration hooks of the normal pipeline into the ASF pipeline extensions that I will describe in the next section.

In the PTLsim-derived simulator Marss86, the core-side of handling loads looks very similar. The key changes are: data is not loaded at issue time, but instead when the load completes; there are less alignment checks (to accelerate simulation); and the memory hierarchy is actuated through a data-driven request interface, including for L1 accesses. Figure 5.17 highlights the key source code constructs.

The memory hierarchy of Marss86 is a complete rewrite: it supports directories, a flexible cache structure, realistic timing for all coherence interactions, and pluggable coherence modules that work independently of the cache structures. All messages between components are represented as messages / requests, and the wake-up actions are chained together by timed call-back functions. Further, the simulated components are split into cache controllers, interconnects, and signals (serving as the timed call-backs). Surprisingly, there is no direct distinction between requests and responses, instead, the direction of travel of a request (from an upper or lower interconnect port) defines whether it is a request

1 voidCacheHierarchy::issueload_slowpath(addr) { 2 L1hit = L1.probe(addr);

3 L2hit = L2.probe(addr);

4 missbuf. initiate_miss (addr, L2hit);

5 }

7 MissBuffer:: initiate_miss (addr, L2hit) { 8 intidx = find(addr);

9 if(idx >= 0) {

10 /* Merge request with existing miss */

11 return;

12 }

13 /* Handle full condition */

14 /* Create new entry */

15 Entry &mb = missbuf[..];

16 boolL3hit = hierarchy.L3.probe(addr); 17 if(L2hit || L3hit) { 18 mb.cycles = ..; //L2 / L3 LATENCY 19 return; 20 } 21 if(probe_other_caches(addr)) 22 mb.cycles = CROSS_CACHE_LATENCY; 23 else 24 mb.cycles = MAIN_MEM_LATENCY; 25 return; 26 } 27 28 MissBuffer::clock() { 29 foreach (mb) { 30 mb.cycles−−; 31 if(mb.cycles == 0) { 32 /* Install in L1 / L2 / L3 */

33 /* Switch to next state */

34 if (mb.state == L1) { 35 /* Set HTM bits */ 36 lfrq .wakeup(mb.entry); 37 } 38 } 39 } 40 41 LFRQ::wakeup(entry) { 42 ready[entry] = true; 43 } 44 45 LFRQ::clock() { 46 foreach(entry) { 47 if(ready[entry]) 48 callback−>dcache_wakeup(entry.req); 49 ready[entry] = false; 50 free[entry] = true; 51 } 52 }

1 ROB::issueload(..) {

2 /* as before ... */

3 /* do NOT load data here */

4 req = memHier−>get_free_req(); 5 req−>init(.., MEMORY_OP_READ); 6 req−>set_coreSignal(dcache_wakeup); 7 boolL1hit = memHier−>access_cache(req); 8 if(L1hit) 9 dcache_wakeup(req); 10 else 11 physreg−>changestate(WAITING); 12 return; 13 } 14 15 OOOCore::dcache_wakeup(req) { 16 if((request−>get_type == ASF_ABORT) || 17 (request−>get_type == ASF_COMMIT)) { 18 asf_pipeline_intercept .cache_done(); 19 } 20 .. 21 rob = req.rob;

22 data = loadvirt(rob. lsq−>virtaddr);

23 asf_pipeline_intercept .load_binds_data(rob, &data);

24 /* Check for store−to−load forwarding */

25 rob. lsq−>data = extract_bytes(data, ..); 26 rob−>physreg−>complete(); 27 rob−>lsq−>datavalid = 1; 28 } 29 30 MemHier::access_cache(req) { 31 ret = cpuController−>access_fast_path(req); 32

33 /* Success : L1 hit or write */

34 if(( ret == 0) || (req−>type == WRITE)) 35 return true;

36 return false;

37 }

Figure 5.17: Core logic and memory system glue logic used in Marss86. Notice how the code is similar to that of PTLsim.

or a response. In addition, despite the elaborate modelling of timing of the cache tags, the caches still do not contain any data. While the pluggable coherence protocol module determines timing, the actual value coherence is provided again by a single global physical memory view. Figure 5.18 shows the high-level connection between the various interconnect and cache call-backs and how they are chained together depending on a hit / miss in specific caches. Note how the overall logic is distributed between the different caches; that way, caches can make decisions locally, and be connected in different topologies easily.

Detailed Store Path Walkthrough As opposed to loads, PTLsim and Marss86 handle stores at the commit stage of the pipeline; and their code is generally quite simple. Figure 5.19 shows the abbreviated code of PTLsim; and it is interesting to see that stores do actually not interact with the cache hierarchy, but instead they directly write to the global physical memory view. Similar to loads, I added simple first-order coherence (or rather, its timing effects) by snooping the caches of other hierarchies and invalidating their copy for local stores in zero time. Furthermore, the code excerpt shows the hooks for the ASF pipeline integration layer.

1 CacheController::handle_interconnect_cb(msg) { 2 if(msg−>sender == upperInter) {

3 /* This is a request */

4 /* Check space */

5 /* Allocate pending request */

6 deps = find_deps(msg−>req);

7 if(deps)

8 /* Deal with dependencies */

9 else

10 cache_access_cp(..); 11 } else {

12 /* This is a response or snoop req / snoop resp */

13 entry = find_match(msg); 14 if(entry) 15 if(msg.hasData) 16 returncomplete_request(msg); 17 else 18 returnhandle_response(msg);

19 /* Handle the remainder in the future */

20 add_event(cache_access_cb, ..); 21 } 22 } 23 24 CacheController::cache_access_cb(req) { 25 if(req−>isASF()) {

26 /* Handle commit / abort messages */

27 add_event(asfAbort, commitLatency, req); // or asfCommit

28 } 29 hit = probe(req); 30 if( hit ) 31 signal = cache_hit_cb; 32 else 33 signal = cache_miss_cb;

34 add_event(signal, cacheAccessLatency_, req);

35 } 36 37 Cachecontroller::cache_hit_cb(req) { 38 if(req−>isSnoop) 39 coherence_logic_−>handle_interconn_hit(req); 40 else 41 coherence_logic_−>handle_local_hit(req); 42 } 43 44 Cachecontroller::cache_miss_cb(req) { 45 if(req−>isSnoop) 46 coherence_logic_−>handle_interconn_miss(req); 47 else 48 coherence_logic_−>handle_local_miss(req); 49 }

1 ROBEntry::commit() {

2 /* Scan the ROB for all sub−uops of the current x86 instruction */

3 /* Check for and handle exceptions */

4 if(uop.opcode == OP_st) { 5 lock = interlocks .probe(addr);

6 if(lock)

7 { /* Deal with bus locks */ }

8 asf_pipeline_intercept .issue_probe_and_merge(addr); 9 } 10 if(uop.is_asf) 11 asf_pipeline_intercept .commit(..); 12 /* Update registers */ 13 if(uop.opcode = OP_st) 14 caches.commitstore(lsq, addr); 15 /* Cleanup */ 16 } 17 18 CacheHierarchy::commitstore(lsq, addr) {

19 storemask(addr, lsq.data, lsq .mask); // writes to backing store 20 probe_other_caches(addr);

21 }

Figure 5.19: Path for stores becoming visible in memory in PTLsim.

In Marss86, the pipeline integration is again very similar; instead of writing to the backing store directly, however, the store is sent as a proper request to the memory hierarchy; in Figure 5.20. Subse- quently, the core still updates the global physical memory view.

In document La persona adulta siente el deseo de aprender en función de todo aquello que le interesa; piensa que debe y tiene que responder con acierto a las (página 59-63)