Capítulo 1: LOS CONCEPTOS
1.4. OTROS CONCEPTOS VINCULADOS CON LA ACCESIBILIDAD
Now we give an implementation strategy for VEC-BSP on its target implementation model BSP. Because BSP has no shared memory the main issue is to specify placement and movement of data in the style of message passing programming, while keeping things simple enough to predict cost automatically. Furthermore, our implementation
model should follow the superstep structure of BSP. We use execution diagrams to ex-plain the implementation model and data movements on it. In our execution diagrams, time proceeds from left to right with activities in a processor proceeding horizontally.
Data flows from left to right being manipulated and transmitted following instructions from terms of the source program. We use one of the processors (at the top in our diagram) as a master processor in which the necessary data is stored at the beginning of computation and the result is eventually stored at the end of the computation.
A complete computation of a program t t0 has a nested structure consisting of four ordered parts, which are illustrated in the diagram of figure 3.3 in which shaded pro-cessors indicate existence of data in those propro-cessors.
• Et0: an evaluation of the argument t0
• Et: an evaluation of the function t
• C: a communication, in which the data of the results of Et0 and Et are redis-tributed to processors for the next process if necessary, followed by a barrier synchronisation
• A: an application, in which the result of Et is applied to the result of Et0.
Et
Et’ A
C Time
Processors
Figure 3.3: Parallel application
Nesting arises because Et0 and Et can themselves be application terms. In each part, the data is stored in the master processor at its end with the exception that automatic
optimisation is used to remove some overhead incurred, as explained later. The appli-cation phase A may be either sequential or parallel. A sequential appliappli-cation illustrated in figure 3.4 is executed only in the master processor when t is a sequential function (its information can be known from the application tuple). There is no communication in
Et Et’
Time
Processors A
Figure 3.4: Sequential application
C because the necessary data already resides in the master processor. The parallel ap-plication illustrated in figure 3.3 is executed among the processors when the function t is a skeleton combinator (this information also can be known from the application tuple) whose parallel implementation template is predefined. We place the following restrictions on the implementation template.
• The template must follow the BSP model, that is computation and communica-tion are separated by machine wide synchronisacommunica-tion.
• The data of the argument is distributed evenly at the beginning of the template.
• All processors perform the same operation.
• The result is eventually stored in the master processor.
Combinators map and fold are typical examples of second-order functions which have parallel templates. The implementation template of map applies the function sequen-tially on the vector segments in each processor then gathers the results to the master.
The fold implementation template folds sub-vectors sequentially on each processor.
Results are transferred to the master processor which folds them together sequentially to compute the overall result. Figure 3.5 illustrates these applications. Solid lines in-dicate computation and dotted lines inin-dicate gather. The narrow vertical box denotes machine wide synchronisation. The details of the full set of our current implementation
map fold
sequential
Figure 3.5: Application examples
skeletons are presented in chapter 4.
Using a skeleton requires communication for data rearrangement in C in which the data describing the result of Et0 is scattered to all processors evenly. If there is any data component of the result of Et, it is broadcast to all processors. For example, if t is map(+t00), where t00 generates some value, then that value must be broadcast to all processors. Therefore, costing communication in C means costing the broadcast and scatter communication. The formula for this will be given in section in 3.5.3.
Figure 3.6 illustrates the scatter and broadcast−scatter communication. In the scatter diagram the master processor (at the top) scatters data between itself and three other processors. Dotted lines indicate scatter and dashed lines indicate broadcast.
broadcast − scatter scatter
Figure 3.6: Communication patterns
Additional comments are required for the implementation of the application of a lambda
E t’ E t
E t’ E t
C A
C A
Figure 3.7: Removal of unnecessary communication
term. Implementation of the application of(λx.t(x)) a is: a is evaluated first and x is substituted by the result of evaluation of a, and t(a) is evaluated in the application part A following the strategy described above. We assume that it takes no time to evaluate the term(λx.t(x)), that means we count the costs to evaluate a and to evaluate t(a), which is performed in the application part and ignore the other costs involved.
Efficiency Problem
We required the parallel implementation templates to store data in the master processor at the end of A. Consequently, the data of the results of Et0and Etare also always stored in the master processor, since these are either themselves nested parallel applications abiding by the same rule, or are already sequential. This rule simplifies implementa-tion and costing communicaimplementa-tion by providing a common interface for communicaimplementa-tion patterns across the nested term. However, it also causes an efficiency problem. For example, if a parallel application process finishes by gathering the local result in each processor to the master, only for these to be subsequently scattered as the inputs to an enclosing parallel function, then the gathering and scattering are superfluous. The upper half of figure 3.7 illustrates the structure of such a computation, for a term of the form map f (map g v). The first phase implements the map of g (with g assumed to be
primitive), the second phase computes f (sequentially in this example), the third phase broadcasts data describing f and scatters the result of the first phase and the final phase computes the outer map. The gathering and subsequent scattering of the result of the inner map is clearly redundant. When the data size of the result of map g v is s,
2· s · p− 1 p · g + l where p, g, l are the BSP parameters, can be saved.
There are several possible solutions. One solution would be to define several versions of a skeleton, with implementations differing only in data distribution at the end of the application process and expect the programmer to chose one of them to optimise each C. It would be performed by hand and makes the programming more difficult. Another solution would be to predefine combining skeletons which combine skeletons so that interface communication of component skeletons can be easily optimised following To’s work [81]. Instead we chose an automated route, demonstrating that our static analysis can be extended to analyse the interface communication pattern by adding an argument data pattern to cost tuples. How our tool detects and resolves such inefficient cases and excludes these unnecessary costs from the predicted BSP cost is described in the next section.