8. Conclusions
8.2 Future Work
The work reported on here focuses on cache issues; other parts of the memory hierarchy can also play a significant role in performance. Also, as cache miss costs increase, there should be a growing convergence between issues in conventional shared memory architectures and in distributed shared memory architectures. This section examines potential future work which would take into account the virtual memory system, followed by possible extensions to take into account issues in distributed shared memory. Obviously, it would also be desirable to do detailed optimization of both the library and specific applications, to further reduce overheads (especially in the case ofMP3D).
Although such applications are not considered here, it should also be noted that another area for future work is investigating the applicability of techniques developed in this research to a wider range of problem areas. Although OOSH is designed for time- stepped simulation, much of the code could be adapted to other purposes, especially the memory allocators.
8.2.1 Virtual Memory Issues
As early as the 1970s, there was work on restructuring programs for good performance in paged virtual memory systems [Ferrari 1976, Hatfield and Gerald 1971, Snyder 1978, Spirn 1977].
A modern paged virtual memory system with page cache management outside the kernel offers possibilities for designing page replacement strategies tuned to a specific application [Harty and Cheriton 1991; Subramanian 1991].
A related issue-discovered as a side-effect of this research-is the potential mismatch between object-oriented code and translation lookaside buffers. A TLB is a small cache of recent page translations, which relies heavily on spatial locality to give good performance. A small TLB with 32 entries each mapping a 4K page may have few misses with code referencing approximately sequential addresses (as in typical vector or matrix processing). However, if objects are allocated without attempting to keep those
112 AN OBJECT-ORIENIED LIBRARY FOR SHARED-MEMORY PARALLEL SIMULATIONS
University
of Cape
Town
2500..._.../---..._~~-,,.-L/~---~---._:J 2000 ---
1500 1000
500 Random Allocation -
Contiguous Allocation ---·
o~~..__~~~_.__~_._~_.
0 200 400 600 800 1000
Timestep
(a) TLB misses per processor per timestep
1r----~---~~---==..-=_,-i 6 ---
5 4 3
2 Random Allocation -
Contiguous Allocation ---·
O'-~...._~-'-~--'--~--'-~-'
0 200 400 600 800 1000
Timestep
(b) execution time per particle per timestep
Figure 8.1 TLB Effects in Large MP3D Run
accessed at roughly the same time close in memory, a high number of TLB misses may result especially with a small TLB. A study of this problem with the MP3D application has revealed that TLB misses could account for an increase of as much as 25% in run time in a run with randomly allocated objects, as compared with a run with contiguously allocated objects.
Figure 8.1 illustrates how on a large 8-processor run, MP3D has an increasing number of TLB misses on a Silicon Graphics 4D/380. The run is on a wind tunnel with 131 by 131 by 7 cells, and 1-million particles. Initially, particles are allocated contiguously within cells. As they move through space, particles close in space gradually cease to be close in memory.
Figure 8. la shows how the number of TLB misses increases with time for a run with particles initially contiguously allocated until-after around 800 time steps-the number of misses is about the same as for a run with particles initially randomly allocated. Figure 8.1 b shows how this effect impacts performance.
In an application like MP3D (or to a lesser extent Barnes-Hut, where movement of bodies is slower), this is a difficult problem to solve. Copying data as it moves through the simulation to keep it contiguous can be as expensive as the TLB misses, as each additional copy can potentially result in a cache miss [Cheriton et al. 1993].
Adding user-level page replacement could be a solution to the TLB miss problem. If data that moves in the simulation (particles in the case of MP3D; bodies in Barnes-Hut)
University
of Cape
Town
active pages cache of free pages pages on disk
copy on
move
page about to be replaced
Figure 8.2 Copy on Move to allow deallocation on page replacement
is allocated a page at a time, when a page is moved from being memory resident to being cached, its contents could be copied to free memory belonging to the part of the simulation the object has moved to. The old page would be marked as garbage for a garbage collector. Some limit is needed, to ensure that thrashing does not occur. For example, reallocation can only be done as long it does not lead to a miss to disk.
Another possibility would be to exploit the fact that the simulation is set up with a mean free path of about a third of a cell, so in about 5 timesteps, there is a high probability that all particles will have left a given cell. If each cell reallocates particles on arrival and uses a simple scheme to keep track of the range of addresses currently in use, a simple page replacement-based garbage collector could be implemented.
Reallocation would still be expensive, but the cost of appending deleted data objects to a free list would be eliminated. The viability of this strategy depends on a number of issues including the relative costs of TLB misses and cache misses.
Figure 8.2 illustrates the idea.
8.2.2 Slower Memory Hierarchies
One of the most important strategies for reducing the cost of misses on distributed shared memory systems is weaker models of consistency. For example, release consistency assumes that all shared data structures that are written are protected by a lock. When a lock is released, consistency must be ensured, but while the lock is still held, it is assumed that other processors that would have referenced the shared data will
114 AN OBJECT-ORIEN1ED LIBRARY FOR SHARED-MEMORY PARALLEL SIMULATIONS
. '
University
of Cape
Town
not see an inconsistent copy since they should be blocked by the lock [Gharachorloo et al. 1990].
Release consistency has been used not only in distributed shared memory systems [Dwarkadas et al. 1993] but in the DASH system [Lenoski et al. 1992]. In fact the DASH designers did much of the early work in this area.
Future shared memory multiprocessor systems with higher memory latencies will increasingly need to use techniques such as release consistency .
The aligned memory allocators used for OOSH are highly suited to release consistency. If a lock is to protect not only a logical data structure but all data structures contained within a given group of blocks, it is essential that there be no false sharing.
Padding and aligning to blocks as big as typical page sizes (4K or 8Kbytes, though recent architectures such as the MIPS R4000 support much larger sizes) is not practical, so techniques to aggregate blocks of related data will become increasingly important.
Implementation of such memory management strategies at a low level as part of a memory allocator is clearly a superior strategy to ad hoc approaches, which are likely to result in considerable rewriting of code as architecture trade-offs change.
Since distributed shared memory systems usually transfer pages on cache misses, strategies such as release consistency would have to be addressed for a DSM version of OOSH.