7. EVALUACIÓN DEL RIESGO
7.1. Masas de agua superficial
spatial and temporal locality.
This principle can also be applied to algorithm and data structure design. Ladner et al. [20] provide a good illustration in their analysis of cache-efficient binary search. In the standard implementation the search keys are stored in sorted order in an array. The first search probe is at the middle (median element) of the array, the second probe is at one of the two quartiles, and so forth, like this:
Key 10 20 30 40 50 60 70
Probes 3 2 3 1 3 2 3
This access pattern has no locality. A better alternative is to organize the array so that the first element probed is at the first position in the array, the second probes are in the next two positions, and so forth, like this:
Key 40 20 60 10 30 50 70
Probes 1 2 2 3 3 3 3
This arrangement has good spatial locality and ensures that the array is accessed in increasing index order, although some indices are skipped.
An even better idea is to organize the keys into groups so that each group is contained in the node of a perfectly balanced binary tree. Such a tree can be laid out in memory (like a heap) so that the locations of parents and children can be calculated arithmetically, rather than with explicit pointers. Ladner et al. [20] describe two tree-based implementations of binary search, as follows.
• A cache-aware implementation exploits platform-specific knowledge of how many keys can fit into a cache line. In general, an interior node in the tree contains a key stored together with its children and immediate tree descendants, as many as the line will hold. For example, the root node may hold the median and the two quartiles; the children of the root hold the octiles; and so forth. This exploits temporal locality because whenx is accessed, its immediate descendants in the
search tree will be loaded together with it into the cache. Here is an example layout assuming three keys per cache line.
Keys (40, 20, 60) (10, 30,x) (50, 70,x)
Cache loads 1 2 2
• The cache-oblivious version decomposes the tree to exploit spatial locality, without making assumptions about cache capacities. Here is how it works: break the binary treeT into groups according to some level h in the tree (h is a power
of 2). The top of the tree forms subtreeT0, which has 2h/4leaves. The remaining 2h/2(disconnected) subtrees form treesT
1. . . T + 2h/2. Store the treesT0,T1. . . sequentially in memory, so that nodes likely to be accessed sequentially in time are near one another, and all probes are in increasing order by address. The authors observe that cache-aware and cache-oblivious variations of this tree structure can improve overall computation time by factors of 2 to 8 over classic binary search, even though the number of elements accessed is identical in all versions.
Tuning for I/O Efficiency
A large body of research has developed around design of I/O-efficient algorithms, also called external memory or “big data” algorithms. These types of algorithms are critically important when the data set to be processed is too large to fit within main memory and must be accessed on disk using file I/O.
I/O efficiency involves a combination of algorithm tuning – so that data elements are accessed in an order that matches their storage layout – and code tuning – so that source code instructions with unusually high cost overhead can be minimized. Here are some sources of high costs in I/O processing, and what to do about them. References to more resources on I/O efficiency may be found in the Chapter Notes.
• Minimize open/close operations. Opening and closing files require scores of machine instructions. To minimize this cost, avoid repeated opening and closing and instead make one pass through the data file whenever possible. Storing data in one big file rather than many small files also reduces open/closing costs. • Reduce latency. Read/write operations create two kinds of time delays: latency
refers to the amount of time needed to contact the disk and get it ready for data transmission, and transfer time refers to the time actually spent moving data elements between disk and main memory. Reduce latency by using a few large data transfer operations – that is, reads or writes with many data elements specified – instead of several small ones.
• Decouple I/O and instruction execution. When possible, remove reads and writes from inside the loops so that instructions do not have to wait on I/O operations. I/O buffering and threading can be used to decouple I/O operations from instruction executions by running the two tasks in separate computation threads.
• Exploit locality. Data access in files can be optimized in ways similar to data access in memory: organize the data on disk to match the computation order and organize the computation order to make best use of spatial and temporal locality.
This last strategy can lead to dramatic reductions in computation time for I/O-bound applications. Here are two of many examples that may be found in the algorithm engineering literature.
Ajwani, Dementiev, and Meyer [1] describe an external memory breadth-first search (BFS) algorithm that can traverse enormous sparse graphs too big for main memory. They show how to decompose these massive graphs into smaller sub- graphs to be stored in files for fast processing in BFS order. On graphs containing 228nodes, their I/O-efficient algorithm takes around 40 to 50 hours (depending on graph type) both to decompose the graphs and to perform the BFS traversal, while conventional methods take 140 to 166 days to traverse the same graphs.
Arge et al. [2] describe a project to develop I/O-efficient algorithms for problems on grid-based terrains. A grid-based terrain is a geospatial data set where each point in the grid is labeled with spatial coordinates (such as latitude and longitude) and an elevation. One problem is to compute the flow accumulation points of the terrain – the low points where water will likely flow. Computing the flow accumulation requires initially placing one unit of flow at every grid point and then distributing flow to neighbor points according to their height differences. For a√N×√N
grid this can be done in memory inO(N log N ) by sorting grid points by height,
then scanning the sorted points and distributing flows to downhill neighbors. This algorithm is not I/O efficient, however, because sorting destroys the geospatial
locality needed to transfer flow from points to their neighbors. In the worst case, processing each point in sorted order would require a read and write for half its neighbors, totalingO(N2) I/O operations.
The authors show how to organize the computation so that flow distribution can be performed in a single I/O pass. They compare the standard internal algorithm and their I/O-efficient version using five geospatial data sets ranging in size from 12MB to 508MB. On small inputs the internal algorithm runs slightly faster than the external algorithm, but once a threshold based on main memory size is reached, the internal algorithm grinds to a halt, spending all its time thrashing among I/O accesses. On one data set containing 512MB grid points the I/O-efficient algorithm finished in about four hours; the authors estimate that the internal algorithm (halted after four days of computation) would have taken several weeks to finish. Guideline 4.20 Pay attention to the frequency, order, and size of I/O requests in
I/O bound computations.
4.2.2 Concurrency
Nowadays every desktop or laptop is a multicore platform with two to eight sepa- rate processors (sometimes more) capable of executing code in parallel. The main tool for speeding up algorithms to run on these new platforms is to apply mul-
tithreading, which splits a given process into two (or more) separate instruction
streams: each stream is called a thread. In a perfect world, a process could be split intop threads to run on p processors and finish p times faster than on one proces-
sor. Of course, this so-called perfect parallel speedup cannot always be realized, since some parts of a computation are necessarily sequential. Algorithm and code tuning strategies can be applied to achieve partial – but still significant – parallel speedups in many cases.
Finding general strategies for exploiting low-level parallelism is a relatively new area of experimental algorithmic research, and there are more questions than answers about how to proceed. One obstacle to progress is the absence of a gen- eral model of parallel computation that reflects real computation times on a wide variety of modern architectures. As a result, an implementation tuned for perfor- mance on one platform may require substantial reworking to achieve similar results on another. Even within a single platform, performance can depend dramatically on how the process scheduler maps threads onto processors and on the order in which separate threads are executed: scheduler decisions are impossible to pre- dict yet may have more impact on computation time than any particular tuneup. Finally, inadequate time measurement tools on concurrent system make it difficult to measure properly the effects of any given tuneup.
Although our understanding of best practice in this area is nowhere near fully developed, a handful of general tuning techniques can be identified. The basic idea is to decompose an algorithm into some number of distinct threads that work on separate subproblems (with separate memory address spaces) and do not need to communicate with one another. Threads slow down when informa- tion must be shared, because communication requires synchronization, which means that one process is likely to be stuck waiting for another. This is true even when communication takes place via data reads and writes to the same virtual address. The cache coherence problem refers to the possibility that processor- specific caches may hold different values for the same element (at the same virtual address), without being aware of one another. If the runtime system does not take steps to ensure cache coherence, the programmer must incorporate synchro- nization code to the threads: either way synchronization slows down the parallel computation.
Here is a list of algorithm and code tuning strategies for exploiting threading on multicore computation.
• Divide-and-conquer algorithms often are natural candidates for paralleliza- tion, since they work by breaking the problem into independent subproblems. Therefore, each recursive procedure call can trigger a new thread that works independently of sibling threads. A small amount of synchronization may be required if the divide-and-conquer algorithm performs a postorder processing step.
• Branch-and-bound algorithms can sometimes be structured so that multiple threads can work independently on solution subsets, except for intermittent sharing of their currently optimal solutions. The question is how to balance the synchronization costs of sharing new solutions against the instruction savings from pruning when better solutions are shared.
• Many array-based computations are natural candidates for parallel decompo- sition, if the arrays can be separated into independent sections processed in separate threads.
• Decoupling slow processes, such as those involving user interfaces and I/O, from the main instruction-heavy computation allows the main thread to avoid being continually interrupted by synchronization requests.
• Minimize threading overhead. Thread creation and destruction have high over- head, so a few long-lived threads may run faster than many short-lived threads. When communication among threads is necessary, a few synchronization phases with larger blocks of shared data are more efficient than many small synchronization phases with smaller blocks of shared data.
4.3 The Tuning Process
We turn now from the question of how to tune algorithms and code to consider questions of when and why.
Certainly performance considerations should come into play well before imple- mentation – let alone code tuning – begins. The code to be tuned should have “good bones,” which can only be obtained by proper decomposition of the system into well-structured components and by choosing the right algorithm to implement in the first place. No amount of tuning can rescue a fundamentally flawed design with poor asymptotic performance from the start.
Furthermore, algorithm and code tuning should not begin until after all the code is written. The tune-as-you-go strategy is a recipe for failure: the important performance bottlenecks in a program can only be identified once the code is complete and can be run on realistic-sized inputs.
Start by building a simple, straightforward implementation of the algorithm and apply your best verification and validation skills to ensure that that implementation is correct.