Memory System Details One of the obvious bottlenecks of TCC is the single commit lock which forces serialisation of commits. Chafi, et al, extend the single commit point by allowing commits to happen in
parallel if they are to different addresses (clearly) and to different directories [138]. Overall, the proto- col stays pretty true to TCC, by executing transactions to completion and collecting read and write sets, and sending them to the responsible directories for commit. These in turn use a two phase validation, then write-back approach. The authors ensure progress with a ticket-lock-like approach where waiters queue at the end of a logical queue [25]. The proposal is not fully parallel and decentralised, as paral- lelism is very much limited by the number of directories and the address distribution across them. The birthday “paradox” will quite likely restrict available parallelism. Finally, the authors have an unrealistic assumption that the directories can track the entirety of physical memory with tx-read and tx-write bits. The paper is notable for the level of protocol interleaving that the authors analysed; thanks to a fairly realistic implementation of the interconnect mesh (yet quite simplistic single-issue cores). Furthermore, the authors extract useful information about transaction sizes (instructions, number of loads / stores, etc. for various workloads).
Similar attention to detail is present in Tomic, et al, EazyHTM: they propose eager conflict detection through coherence messages, yet resolve conflicts only at the end of execution [196]. They also stress that serialised commit in hardware transactions is a big scalability issue and propose fully parallel commit by detecting (the absence of) conflicts in a distributed fashion eagerly. They employ two way tracking, i.e., incoming and outgoing conflicts for additional stability of the algorithm, and add typical tx-write bits per cache lines for reduced traffic. While commit is parallel between cores, each writing transaction needs to serially write-back its write set, which reduces local MLP and commit performance. On the core / ISA side, this proposal is quite straightforward with simple cores, full register snapshot; the attention of the proposal is in the memory system interactions.
Another detailed look at TM implementation aspects is Sanchez’, et al, detailed look at signature implementation options [143]. They show that instead of using a single k-valued Bloom filter, it almost as good to use k single-valued functions in a split filter; and much easier and smaller to implement. They also find that degenerate signatures can be especially limiting for larger systems, and note that other snoop filtering techniques (such as inclusive higher-level caches, snoop filters, or directories) can restore precision and thus reduce the impact of the Bloom filter.
Yen, et al, also investigate optimisations to Bloom filters for Transactional Memory [178]. They show that selecting the right subset of bits (lower level) for hashing can improve the Bloom filter degradation, and filtering thread-local accesses from the TM (and the filter) further leaves valuable room for those accesses that require synchronisation.
Bobba, et al, lift the concept of token cache coherence [71] to transactions and support arbitrary sized transactions by tracking tokens for data items in the entire memory hierarchy (using ECC for main mem- ory) [157]. Readers need to acquire a single token, while writers need to acquire all tokens for a specific data item. On top of the full credit / token mechanism, the authors implement faster simplifications for small transactions that fit into the L1. Writes are buffered (again!) in a software-visible log. In their evaluation, the authors remove the logging problem by actually performing logging in software (with very little overhead claimed).
Blundell, et al, propose an interesting way to precisely capture a larger working set in their OneTM extension to LogTM [121]. On overflow, they convert a part of the L2 cache to cache only permissions, instead of also data for a specific address range. That effectively increases the coverage of a cache-line to N ∗ 8/2 cache lines (with N being the cache line size in bytes). Similar techniques are being used in sub-sectored caches [36], and for AMD’s probe filter design that uses part of the LLC capacity to track only the coherency state of more lines [216]. For full capacity virtualisation, the authors propose a some- what heavy scheme in the CPU that supports transactional meta-data bits in memory. In comparison to the heavy proposals in VTM [86], OneTM has a lighter approach, as only a single overflowed transaction
can exists (concurrently with non-overflowed transactions) and so only a single copy of dedicated trans- actional meta-data bits is ever required. OneTM allocates these bits in physical memory and thus reduces the available memory space.
Blurring the Eager / Lazy Distinction Lupon, et al, revisit the logging TM systems and propose an eager / lazy hybrid, the fast-path of which turns out to look very similar to what I explore in this work (and most industry proposals) [205]. Their main contribution is improving the abort speed for logging TMs and showing that this indeed has a significant impact on overall application throughput. While they are trying to hang on to the eager / lazy terminology, it becomes apparent that in hardware TM implementations, these distinctions are not always useful / obvious. They do, however, identify the technique to eagerly acquire write permissions for transactionally written lines, and lazily hiding their values in the L1 data cache (plus also putting them in a log for virtualisation). In the fast path (no overflowing lines), their aborts can execute in hardware, rather than using a software abort handler to consult the undo log. They also explicitly mention that overflowed lines need to NACK a requesting sender and require SW involvement before the requester can actually read the value – a property which is often unacceptable in modern, complex, deadlock-free interconnects and coherence protocols. One interesting observation, however, of the authors is that transaction write sets very often fit into the L1 cache, and using signatures for tracing the read sets and keeping transactionally written data in the cache as long as possible often avoids the need for a logging slow path.
In a follow-on publication, Lupon, et al, further dissect the coexistence of eager / lazy transactions and switching between them with a predictor; they also seem to be the first to realise that with the right amount of eager / lazy mix, distributed commit seems possible [228]. They show how “eager” (really fully logging, decoupled from caches) transactions and “lazy” (using the cache hierarchy and coherence mechanisms) can coexist. Both modes are similar when the working set (the write set) fits into the L1 cache: conflicts are detected eagerly; in the “eager” mode, they abort transactions immediately, in the “lazy” mode, handling the conflict is postponed until transaction commit time. The authors also propose several extensions for the coherence protocol, enabling coexistence of eager / lazy transactions, but also significantly making the design more complex.
Improving Progress One important aspect of flexible eager / lazy switching are the different progress characteristics. Similar to work in software TM, improving progress of hardware TM implementations has been looked at in the literature. The challenge is to balance the complexity (resources, changes to coherence protocol and structures) of the scheme employed and the provided benefits.
Bobba, et al, investigate performance pathologies in HTM, and they find that under low contention levels, most systems perform similarly [142]. Under high contention, however, they identify several patterns depending on the different conflict detection / conflict resolution / versioning design points. In particular, they find that no single outperforms the others for all workload cases. Simple additional policies, such as exponential back-off, selective early write acquisition, and adding timestamps to memory requests to abort younger transactions provide significant performance improvement in the pathological cases.
Ramadan, et al, go further than Bobba, et al: they maximise transactional throughput under con- tention by tracking dependencies between communicating in-flight transactions, rather than aborting con- servatively on any communication [167]. For that, they order commits (and subsequent writes) of com- municating transactions, and only abort in cyclical dependency cases. The authors use a distributed ordering vector that is read and updated frequently, and extend the cache coherence protocol to eleven stable states to track all occurring forwarding conditions (typical protocols such as MOESI have five sta- ble states!), they also show that their mechanism requires sub-cacheline coherence tracking to unlock the
full performance potential. Extending a MetaTM implementation, they obtain 30% speedups in STAMP. Similarly, Titos, et al, also track dependencies and use a hybrid eager / lazy mechanism to get trans- actions committed in conflict cases [203]. Instead of aborting writers in RAW conflict cases, the readers read the pre-transactional value from where it is stored. In turn, the reader then has to commit before the writer. Similarly, in WAR cases, aborts are not necessary, if the writer commits after the readers. Exper- iments on STAMP in a tiled SMP, extending a LogTM-SE protocol in GEMS, the authors show that their system adapts to the better eager / lazy policy and thus reduces memory traffic and improves throughput. Typically, it is either caches storing transactional data, or the data is appended to a write log in most TMs. Yan, et al, identified the required data movement in case of aborts / commits as a major source of bottleneck [302]. The authors instead propose to store both transactionally produced and pre-transactional versions of data in the same pseudo-associative cache [32], and flip the N-th bit of the address to denote the N-th version of the data. Together with tracking the partial order of concurrent transactions, they can enforce that there are no cycles, by having transactions read the right version instead of creating a dependency in the wrong direction. In addition to a changed coherence protocol, the authors use counting bloom filters [45] to track overflowing locations.
Negi, et al, propose a similar design, but keep all the required additional information in a sepa- rate cache array [293]. Their separate cache tracks the transactional dirty copy, the according pre- transactional clean copy of the data in the same cache, and also supports clearing of the transactional dirty bits per way. When the main cache observes multiple tag hits, the separate cache distinguishes which version of the data to access. Furthermore, the by-way management can allow simple partitioning of the transactional resources for and simplify conflict detection between multiple threads on a single core. Similar to Yan, et al, the authors here also find that storing both copies in the same cache can have a significant benefit for high-contention scenarios, and observe speedups of 30% in STAMP intruder.