156 3 Datos sobre compromisos adquiridos e iniciativas en curso
2013 4. Número de proyectos de
30 Capacidad de gestión de residuos
As discussed in Section 3.5, programs with large data caches pose a challenge for tracing garbage collectors. I wrote a small cross-language microbenchmark, called minicache, to quantify the cost of tracing GC for such caches. The structure of the minicache heap is illustrated in Figure 6 (minus the pointer between entries). There is an array containing B binary trees comprised of N nodes each. The minicache workload is to make K passes over the array, allocating and inserting one replacement subtree at a time. The results of running minicache (B = 1000, N = 700, K = 300) are presented in Figure 19. The left plot shows the behavior of a variety of collectors (parallel, serial, and concurrent) for the benchmark running on several widely used Java virtual machines. The right plot shows, with the same axes, the behavior for the same benchmark written in Go (using a concurrent collector) and Foster (with and without subheap augmentation). Each program was run five times per heap size. The plots show raw datapoints, not averages. Run-to-run timing variance was low. Figure 20 explores different Immix variants on the Minicache workload. Generational collection with sticky mark bits performs worse than plain Immix—unsurprising, be- cause the lifetime of cached data does not follow the generational hypothesis. Mean-
Figure 20: Comparison of Foster’s Immix variants on minicache.
while, ImmixRC’s behavior is independent of heap size. It is faster than non-RC Immix in small heaps and slower in larger heaps. While both ImmixRC and sub- heaps have flat profiles, ImmixRC is slower by a large constant factor. The difference is that subheaps reduce the benchmark’s workload by making sure objects need not be traced, whereas ImmixRC merely enforces a heap-size independent workload: every allocation is effectively traced twice with little amortization (once each for recursive marking and unmarking).
What makes these graphs interesting is the shape of the results: the tracing collec- tors exhibit classic space-time tradeoff curves, which reference counting—even when “emulated” with subheaps—avoids. By putting each subtree in a separate subheap we combine the low cost of region-based reclamation with a reliably flat performance profile.
The authors of M3 [TAV14] were partially motivated by the poor performance of a
Recent versions of Go offer a state of the art concurrent collector. Would it have have avoided their woes? The results in Figure 19b suggest not. Especially in tight heaps, concurrent collection cannot overcome the sheer amount of work generated by the minicache workload.
4.4.1. memcached & ghost thereof
Minicache is designed to throw the memory management issues of a cache into stark relief. These effects are muted in a real cache for several reasons. First, minicache simulates a cache’s workload with no superfluous influences: the workload involves no hashing, nondeterminism, or I/O. A real cache must do this extra work, which obscures the costs of GC. Second, minicache stores large, pointer-dense object graphs, which amplify GC work. Many caches store data like strings or binary blobs which do not need tracing. Thus, while caches are not GC-friendly, most caches will not observe the severe (exponential decay) throughput impacts illustrated.
However, throughput is not the only relevant performance metric for a cache. Latency is often a more critical concern for network-enabled cache servers. The minicache benchmark cannot realistically measure end-to-end latency. To demonstrate the effect of subheaps on a more realistic server, I implemented mcd: a minimal network-enabled clone of Memcached in Foster. Using a lexer compiled from C into Foster, plus bindings to the POSIX sockets API, mcd parses and implements the GET and SET commands in the memcached wire protocol.
The mutilate program [LK14, Lev14] was used to generate memcached wire traffic with a mix of 90% reads and 10% updates. Three configurations for mcd were tested. Since our workload induces 166 MB of allocation, a 170 MB heap is large enough to avoid GC entirely. Reducing the heap size to 130 MB results in one garbage collection
Figure 21: End-to-end memcached workload latency
cycle, which subheaps avoid. The results of testing these variants of mcd, along with
memcached itself, are shown in Figure 21. Reading left to right: When the heap
is large enough to avoid garbage collection, mcd shows max latencies comparable to
memcached.6 In a smaller heap, the cost of GC is reflected in severe degradation of max
read request latency. The application of subheaps, in the smaller heap, successfully replaces one costly GC with almost thirty thousand cheap GCs, each of which costs barely more than a microsecond. This effectively eliminates the latency impact of garbage collection for the mcd server. The GC-induced throughput degradation for
mcd-130 is 1.8%, increased to 4.2% with subheaps. Most of the lost throughput for
subheaps is due to repeated stack scans.
Experience Although the mcd server loop is relatively simple, it still highlights
four interesting phenomena surfaced by applying subheaps in practice. Some of these findings have also been explored in Section 3.4.
First, when the goal is to not merely reduce but entirely eliminate full GCs, we 6 memcached’s throughput is roughly four times that of mcd; like most functional languages,
must capture all allocated data within the server loop, not merely the subset of data allocated within each cache bucket. Otherwise, the un-captured data will accumulate and eventually trigger a collection. When data of varying lifetimes is interleaved, proper separation can increase the subheap annotation burden.
Second: circumstances sometimes force allocation to occur before the “proper” des- tination subheap is known. The Memcached protocol has clients send servers lines with a command name, followed by a key, followed by command-specific fields. There is a bit of a catch-22 with the key’s memory management: it must be allocated in a bucket’s subheap to detect hash collisions, but the choice of what bucket—and therefore what subheap—to use can only be made after it has been extracted from the network, and thus allocated in some other subheap. To resolve the mismatch in object lifetimes, the programmer must store a fresh copy of the key in the cho- sen bucket’s subheap. Failure to do so creates a long-lived subheap-crossing pointer, destroying the potential for subheaps to improve performance.
Third, there can be tension between separation of concerns in code versus data. Cache buckets can be empty, and each non-empty cache bucket needs an associated subheap. One scheme for this is to create subheaps on demand, as each bucket transitions from empty to non-empty. This allows the use of subheapOf (see Section 2.7.2) without needing any changes in data representation, but it mixes unrelated concerns in the server response loop. Alternatively, creating a subheap in advance for each bucket leads to better separation of concerns in code, but it no longer suffices to use a null pointer to represent an empty cache bucket. Some change in data representation is needed for the web server to map empty buckets to their respective subheaps. Finally, care must be taken not to capture too much data in a subheap. Interleaved with the allocations that must go in each bucket, the server also generates some short-
lived garbage. Sticking this garbage in the long-lived cache entries inflates the amount of space needed to store cache entries in subheaps. Whether this is acceptable or not depends on the amount of provisioned heap space.