4. Análisis de resultados
4.4 Caracterización del Riesgo
4.4.1 Coeficiente de peligrosidad
The stencil problem has been widely studied due to its importance and occurrence in computational sciences. Several optimization techniques have been proposed over time.
Stencil codes are often easy to parallelize, but it is difficult to obtain a good ratio of the peak performance, due to their usually low arithmetic intensity. The techniques proposed, either act on the loop structure, re- arranging it, but performing exactly the same operations, or on the algo- rithm, dramatically changing the number of operations required.
To the first group belong cache blocking and time skewing, that are implementation-only optimizations, where the loops are restructured to improve performance. To the second group of optimizations belong multi- grid and adaptive mesh refinement.
Since it is complicated to rearrange the loops, ad hoc stencil compil- ers have been developed and studied: between these, we find Pochoir, Halide, PLUTO, andPATUS.
Recently there are efforts to port stencil codes to GPUs via compilers like TOAST [111].
In this work, we consider two stencil compilers that use different ap- proaches: PLUTO [18], a source to source C compiler that exploits the polyhedral model to find affine transformations for efficient tiling both in space and time, and PATUS [26] that uses a DSL language to effectively and productively describe a stencil and then auto-tunes the generated code according to a predefined strategy.
Cache Blocking
As previously discussed, the capacity of the cache affects the number of cache misses. If we consider the example of stencil sweeps in 8.2.3, it is possible to block the loops in a way that allows holding in the cache a useful working set, a well-known concept deriving from the cache block- ing techniques applied to dense matrix-matrix multiplication.
In two dimensions is necessary to block only the unit-stride loop, while in three dimensions either the unit-stride, the middle dimension,
8.2. PERFORMANCE MEASUREMENT 97
or both loops need to be blocked to maintain a cache-friendly working set.
Time Skewing
A stencil is applied multiple times, over several sweeps. Cache blocking only works inside a single sweep. The same concept of blocking the loops could also be applied to the time loop, thus obtaining a space-time block- ing. In this way, once the points of interest are in cache, they advance in time, to maximize their reuse, and possibly drive to an increase of the arithmetic intensity. It must be noted that such a technique is limited both by the bandwidth to the cache and the in-core performance.
Cache-oblivious algorithms were applied to structured grid codes [44,
78] organizing the space-time in trapezoids and parallelepipeds, which are traversed in a recursive ordering. Such a way of traversing is so ex- pensive to cancel the benefits given by the reduction in cache misses, thus resulting in even slower code. Cache aware implementations have been introduced [91, 119,121,137]: they adopt the idea of dividing the space- time into trapezoids and parallelepipeds but use complex loop nests in- stead of the recursion. The code complexity drastically increases with the number of dimensions that are blocked.
Chapter 9
Experimental Testbeds
We set up and execute two macro-experiments (as defined in Section4.1) using up to three different methods, on two systems. A precise descrip- tion of the experiment, in terms of (Problem, Method, System), is a base step towards the reproducibility of the research.
As sustained in Sections5.1, sharing the source code is beneficial, but its availability is not sufficient for reproducibility. In fact the code may not compile, or the results could be affected by the differences of other components in the software stack. Pieces of information such as version of the compiler, compilation flags, configurations, experiment parame- ters, and raw results are fundamental for the reproducibility of an exper- iment.
The most important conference in the field of high performance com- puting, SuperComputing, has since its 2016 edition, launched a repro- ducibility effort, inviting the authors to submit, together with their pa- pers, an artifact description, i.e. an appendix describing the details of their software environments and computational experiments, so that an independent person could replicate their results. Such an appendix will be mandatory for papers submitted into the main track, starting from SuperComputing 2019.
It is worth noting that the information stored byPROVA! provides all the necessary fields to fill such an appendix, plus additional details that may be used for further analysis, a posteriori. Configurations, methods, source code, methodTypes (with their respective run instructions and easy- configs) used in the experiments later discussed in this work, are available
at [54].
9.1
Systems
The experiments that have been performed in this work, unless differ- ently specified, have been executed in two high performance computing facilities, geographically far away from each other, composed of compute nodes with different architectures. The stencil experiments have been run on a single node of the Emmy and miniHPC clusters.
The software stack is maintained by PROVA! v0.3 (on both clusters) and, at each time, only the needed software is present in PATH.