daiLy LessOns
E. U Escritura/Arte
4. El puma es de Pepe y el pollo también
Supporting multiple coherence implementations enables software to dynamically select the appropriate mechanism for blocks of memory. Supporting incoherent re- gions of memory allows more scalable hardware by reducing the number of shared lines, resulting in fewer coherence messages and less directory overhead for track- ing sharer state. Furthermore, having coherence as an option enables trade-offs to be made regarding software design complexity and performance. Even for appli- cations that do not need data to transition frequently between SWcc and HWcc,
a hybrid memory model provides the runtime with a mechanism for managing coherence needs across applications. Put another way, hardware cache coherence allows the runtime or operating system to put memory into a consistent state even when software performs an incorrect action.
From the perspective of system software, HWcc has many benefits. HWcc al-
lows for migratory data patterns not easily supported under SWcc. As discussed in
Chapter 3, the implementation of Rigel’s runtime system, the Rigel Task Model, required coherence. The lack of hardware coherence for all memory led to the use of global memory operations, which are not locally cacheable, resulting in increased bandwidth and latency for RTM memory accesses that are otherwise amenable to caching.
Threads that sleep on one core and resume execution on another would need to have their local modified stack data available, forcing coherence actions at each thread swap under SWcc. Likewise, task-based programming models [79, 80] are
aided by coherence. HWcc allows children tasks to be scheduled on the same core
would allow data to be pulled using HWcc.
Systems-on-a-chip, which incorporate accelerators and general-purpose cores, are available commercially [104] and are a topic of current research [16]. The assortment of cores makes supporting multiple memory models on the same chip attractive. A hybrid approach allows for cores without HWcc support, such as ac-
celerator cores, to cooperate with cores that do have HWcc support and interface
with coherent general-purpose cores. While a single address space is not a require- ment for heterogeneous systems, as demonstrated by the Cell processor [57] and GPUs [4], it may aid in portability and programmability by extending the current shared memory model to future heterogeneous systems. A hybrid approach allows for HWcc to be leveraged for easier application porting from conventional shared
memory machines and easier debugging for new applications. SWcc could then
be used to reduce the stress on the hardware coherence mechanisms to improve performance.
8.1.4
Summary
Hardware-managed and software-managed cache coherence offer both advantages and disadvantages for applications and system software. We list many of the trade-offs in Table 8.2. A hybrid memory model such as Cohesion leverages the benefits of each while mitigating the negative effects of the respective models. The key benefits from SWcc are reduced network and directory costs and the potential
to avoid false sharing without programmer intervention. The key benefits from HWcc are its ability to share data without explicit software actions, which, as we
demonstrate, can be costly in terms of message overhead and instruction stream inefficiency. A hybrid approach can enable scalable hardware-managed coherence by supporting HWcc for the regions of memory that require it using SWcc for data
Table 8.2: Trade-offs for HWcc, SWcc, and Cohesion.
Programmability Network Constraints On-die Storage
HWcc Conventional CMP shared- memory paradigm; supports fine-grained, irregular sharing without relying on compiler or programmer for correctness
Potential dependences handled by hardware instead of extra in- structions and coherence traffic
Optimized for HWcc: when HWcc desired, coherence data stored efficiently
SWcc Used in accelerators; provides programmer/compiler control over sharing
Eliminates probes/broadcasts for independent data, e.g., stack, private, immutable data
Optimized for SWcc: minimal hardware over- head beyond hardware- managed caches
Cohesion Supports HWcc and SWcc; clear performance opti- mization strategies allowing SWcc⇔ HWcctransitions
SWcc used to eliminate traffic for coarse-grain/regular shar- ing patterns; HWcc for unpre- dictable dependences
Reduces pressure on HWcc structures; en- ables hardware design optimizations based on HWccand SWccneeds
that does not. In comparison to a software-only approach, a hybrid memory model makes coherence management an optimization opportunity and not a correctness burden.
8.2
Design
Cohesion provides hardware support and a protocol to allow data to migrate be- tween coherence domains at runtime, with fine granularity, and without the need for copy operations. Figure 8.1 shows the relationship between the HWcc and
SWcc protocols. The default behavior for Cohesion is to keep all of memory
coherent in the HWcc domain. Software can alter the default behavior by mod-
ifying tables in memory that control how the system enforces coherence. Data that are not shared, or that can have coherence handled at a coarse granularity by software, use the SWcc domain and no hardware coherence management is
applied.
The rest of this section describes the protocols and hardware support required for Cohesion that enable on-the-fly coherence domain changes. Note that with minor restrictions, the selected hardware and software protocols used by Co- hesion could be exchanged for other implementations, but the basic technique
Clean SWCL Immutable SWIM Private (Dirty) SWPD Shared HWS Modified HWM (Per Line) (Per Word) LD Load ST Store
INV Software Invalidation WB Software Flush
WrReq Write Request (ST) RdReq Read Request (LD) WrRel Write back dirty line RdRel Release Line, invalidate locally
LD LD WB ST LD INV ST ST LD LD INV WrReq WrReq RdReq WrRel RdRel LD Private (Clean) SWPC Invalid HWI LD ST LD ST Synchronize SW-to-HW Transitions
Figure 8.1: Cohesion state diagram.
provided by this work would remain the same.