1666615x

Texto completo

(1)c British Computer Society 2001 . A Data-Parallel Formulation for Divide and Conquer Algorithms M. A MOR1 , F. A RG ÜELLO2 , J. L ÓPEZ3 , O. P LATA3. AND. E. L. Z APATA3. 1 Department of Electronics and Systems, University of La Coruña, E–15071 La Coruña, Spain 2 Department of Electronics and Computation, University of Santiago de Compostela,. E–15782 Santiago de Compostela, Spain 3 Department of Computer Architecture, University of Málaga, E–29071 Málaga, Spain. Email: [email protected] This paper presents a general data-parallel formulation for a class of problems based on the divide and conquer strategy. A combination of three techniques—mapping vectors, index-digit permutations and space-filling curves—are used to reorganize the algorithmic dataflow, providing great flexibility to efficiently exploit data locality and to reduce and optimize communications. In addition, these techniques allow the easy translation of the reorganized dataflows into HPF (High Performance Fortran) constructs. Finally, experimental results on the Cray T3E validate our method. Received 23 December 1999; revised 5 April 2001. 1. INTRODUCTION. deal with DC problems, typically the data is already structured either as linear arrays or as trees. In the second case, we may use space-filling curves [4], such as, for instance, Peano–Hilbert (PH) curves.. The design of compilers for parallel machines that generate (semi-)automatically parallel programs with acceptable performance is a research area of increasing interest. In order to facilitate the analysis done by the compiler, parallelization environments based on language extensions have been proposed. High Performance Fortran (HPF) [1, 2, 3] is one of the most significant cases, based on the data-parallel paradigm. HPF permits the programmer to specify data distributions, parallel sections, communications/synchronization optimizations and so on. However, writing efficient HPF programs is not necessarily a trivial task. Indeed, the high-level nature of the language often leads to inefficient parallel code. Additionally, the suitability of HPF to obtain efficient parallel codes for a large variety of complex applications has not been sufficiently proven. The programmer needs a good understanding of both the application and the HPF execution model in order to map data efficiently on the processors’ memory and to organize communications. In order to help the programmer in this difficult task, we have developed a data-parallel framework that permits the description in a uniform, methodical and precise way, of an important class of complex problems, based on the divide and conquer (DC) method. Under this framework, the problem may be adapted to the data-parallel programming paradigm, in such a way that it can be easily translated into HPF. Our data-parallel formulation of a DC problem requires four phases.. Workload balancing. Finally, in dynamic computations, a workload balancing scheme should be included in the parallel problem.. Linearization. The programming model implies the use of linear arrays to represent the data. As our aim is to. Figure 1 shows schematically the two key components of the described parallelization framework.. T HE C OMPUTER J OURNAL,. Data distribution. After linearizing the problem, mapping vectors [5] are used to describe the mapping problem, that is, how data and computations are mapped on the processors of the parallel computer. In a data-parallel paradigm, only data distributions are considered, as computations are assigned to processors based on the owner-computes rule. Computation/communication organization. This is one of the key steps when obtaining an efficient parallel implementation of the DC problem. Index-digit operators [6] are used to organize the problem dataflow, expressing computations and communications as operator strings. This operational description of the problem may be used to rearrange its dataflow, with the aim of optimizing data locality (minimizing communications) and workload balancing. As data are already arranged as linear arrays (sequences), each data item may be identified by its position (or index) in such a sequence. Hence, index-digit operators are defined as permutations of the digits in the numerical representation of the indices of the data sequence. Such permutations result in permutations of the data sequence itself.. Vol. 44, No. 4, 2001.

(2) 304. M. A MOR et al. Mapping Vector. Index-Digit permutation. Data Distribution. Communication/Computation. HPF distribution and aligning directives. Array features. Local extrinsic procedures. FIGURE 1. Key components in the data-parallel framework for divide and conquer problems.. We have used this formulation previously for mesh connected multicomputers, but using the message-passing programming model [7, 8]. In this paper, however, we adapt our parallelization framework to the data-parallel paradigm, and present it in a more detailed and general way (for instance, a performance model is developed), including a guideline for applying it to a particular DC problem. Computation and communication operators are designed in such a way that they can be easily translated into HPF constructs, without needing to add new features to the language and compiler. In particular, computations are expressed as extrinsic (local) procedures, while communications are basically expressed as array sectioning operations. As a result, the formulated problem can be easily translated into an equivalent HPF program. The main goal of our data-parallel formulation is to reach a trade-off parallel solution considering a general formulation of the problem and overall performance. In this paper we will first consider regular DC dataflows, which we classify into two groups, successive doubling and trees. As examples to validate our method, we will take the self-sorting fast Fourier transform and some tridiagonal system solvers. Afterwards, an irregular DC dataflow (the Barnes–Hut method for solving N-body problems) will be taken into consideration. Octtrees resulting from this method are linearized by a space-filling curve, permitting in this way the use of similar techniques to those used for the regular case. HPF formulations obtained from these algorithms have been implemented on a Cray T3E multiprocessor. In these experiments, we show that the cost of data redistributions among different processors for irregular dataflows is comparable to that for regular dataflows. The organization of the rest of the paper is as follows. In Section 2 we present our data-parallel formulation in terms of the index-digit operators and the mapping vector technique. Section 3 introduces the parallelization guideline and the performance model. DC problems are also introduced and analysed in the context of the data-parallel formulation. Section 4 evaluates experimentally the selected problems and discusses the results obtained. Finally, some related work is discussed in Section 5 and in Section 6 the paper is concluded. 2. MAPPING VECTORS AND INDEX-DIGIT OPERATORS Let us define the index-digit representation [6] of a data sequence of size N = r n , where r is some integer greater T HE C OMPUTER J OURNAL,. than 1 (called base or radix). In this representation, each data item of the sequence is denoted by the base r decomposition of its index. That is, data item A(t) with index t = tn r n−1 + . . . + t2 r + t1 is written as [tn . . . t2 t1 ]. Consider now that our parallel machine is a distributed memory multiprocessor. Without loss of generality, we can assume that the processors are organized as a linear array, Q = r q . After distributing the data sequence A among the processors, each data item A(t) may be characterized by a pair of two coordinates (distribution representation), the first one representing the processor assigned (processor coordinate) and the second one representing the local memory address (memory coordinate) where it is stored. Considering perfect load balancing, each processor will be assigned a total of U = N/Q = r u data items (u = n − q). If a different processor topology is considered, the processor coordinate could be subdivided into other coordinates. As an example, for a mesh topology, the (row, column) coordinates should be adequate. 2.1. Mapping vectors A mapping vector is defined as a one-to-one assignment between both representations, the index-digit and the distribution. That is, it specifies which of the ti (1 ≤ i ≤ n) digits correspond to the processor coordinate (a total number of q digits) and which ones correspond to the memory coordinate. As an example, if we assign the q most significant digits of the index-digit representation to the processor coordinate and, therefore, the remaining n − q digits to the memory coordinate, then the mapping vector corresponds to a block distribution of the data sequence. On the other hand, if the q least significant digits are assigned to the processor coordinate, a cyclic distribution results. In general, a block-cyclic distribution, with a block size of r k (0 ≤ k ≤ u, where u = n − q), is specified by assigning the set of digits [tq+k . . . tk+1 ] to the processor coordinate, and the remaining set [tn . . . tq+k+1 , tk . . . t1 ] to the memory coordinate. Note that these mapping vector assignments have a direct translation into HPF data distributions. 2.2. Index-digit operators Index-digit operators are used to specify the problem dataflow. Thus we will have two types of operators, corresponding to computations and data rearrangements (including communications). Operators of the first type transform the data sequence into another of the same size. They represent in-place arithmetic operations on the data items, and hence do not modify the data coordinates. Operators of the second type represent data reordering, which can be local or imply data movement among processors (i.e. communications). These operators modify the data coordinates, and are defined as permutations of the data item index-digit representation. 2.2.1. Computational operators Now let us define now some basic computational operators. Vol. 44, No. 4, 2001.

(3) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS 000 0. 0. 0. 0. 0. 0. 001 1. 1. 1. 1. 1. 1. 010 2. 2. 2. 2. 2. 2. 011 3. 3. 3. 3. 3. 3. 100 4. 4. 4. 4. 4. 4. 101 5. 5. 5. 5. 5. 5. 110 6. 6. 6. 6. 6. 6. 111 7. 7. 7. 7. 7. 7. Bi. B’i. B’’i. FIGURE 2. Butterfly operators (Bi , Bi , Bi ) for a data sequence of length N = 8, radix r = 2 and i = 1. Data items are represented by their indices in the sequence, while the black filled circles represent an arithmetic operation. The leftmost column shows the index-digit representation of the data sequence.. D EFINITION 2.1. The butterfly operators, Bi , Bi and Bi , perform the following data operations, • Bi reads N/r sets of r distinct data items in the sequence differing only in their ith digit ({[tn . . . ti . . . t1 ], 0 ≤ ti < r}), • Bi is similar to Bi , except that it also reads, for each set, the r − 1 data items of the form {[tn . . . ti+1 0ti−1 . . . t1 ] − [0 . . . 0ti 0 . . . 0] ≥ [0 . . . 0], 0 < ti < r}, • Bi is similar to Bi , except that it also reads the data item ([tn . . . ti+1 {r − 1}ti−1 . . . t1 ] + [0 . . . 010 . . . 0] ≤ [1 . . . 1]), then perform any required arithmetic operations on data and write the r results in positions {[tn . . . ti . . . t1 ], 0 ≤ ti < r}. The + and − operations carried out on the index-digit representation are understood as follows. Given the indexdigit representations a = [an . . . a1 ] and b = [bn . . . b1 ], then [an . . . a1 ] + [bn . . . b1 ] is the index-digit representation of a + b. Examples of butterfly operators are graphically depicted in Figure 2, where N = 8, r = 2 and i = 1. The three operators compute four butterflies. Each butterfly in operator B1 , B1 and B1 reads two, three and four data items, respectively. For the case r = 3, operators B1 and B2 are depicted at the top of Figure 3. Let us analyse the behaviour of the butterfly operators on a data-parallel programming environment. A data-parallel model makes explicit the parallelism in terms of parallel operations on data aggregates. This paradigm is based on a single thread of control and a globally shared address space. Parallel executions are determined by both how data is distributed among processors and the owner-computes rule. Each sentence is executed by the processor that owns the data item to be written or modified, that is, the data item that appears in the left-hand side of the sentence. The data needed for such computation (variables at the right-hand side of the sentence) may be local or not. In the latter case, a T HE C OMPUTER J OURNAL,. 305. 00. 00. 00. 01. 01. 01. 02. 02. 02. 10. 10. 10. 11. 11. 11. 12. 12. 12. 20. 20. 20. 21. 21. 21. 22. 22. 22. B1. B2. Proc 0. Proc 1. Proc 2. 0 1 2. 3 4 5. 6 7 8. B1. first butterfly of the operator. 0. 1 2. 3 4 5. 6 7 8. B2. FIGURE 3. Dataflow of a two-stage algorithm using butterfly operators B1 and B2 , for a data sequence of length N = 9 and radix r = 3. Data items are specified by their index-digit representation. Assuming a block data distribution on a three-processor machine, the leftmost digit in the above representation is assigned to the processor field, while the other digit corresponds to the memory coordinate. Thus, all butterflies in B1 are computed locally, while the second stage (B2 ) implies global communications. In the bottom of the figure, the sets of data we have marked are those that recombine in the first butterfly of B2 and the arrows represent the communications needed to compute such butterfly.. communication operation to gather that data is inserted by the compiler in front of the sentence. Consider again the butterfly operator Bi . Subindex i represents the digit with respect to which the index-digit representation of all input data differs. Thus, and once the mapping vector has been chosen, if the ith digit has been assigned to the memory coordinate, any set of r input data to operator Bi will have the same processor coordinate. Then, Bi will execute computations concurrently in all processors using only local data. Otherwise, if the ith digit is in the processor field, then each input of a single butterfly is in a different processor. Hence global communications among r processors are needed in order to have a copy of the Vol. 44, No. 4, 2001.

(4) 306. M. A MOR et al.. input sequence to this butterfly in the r processors (each sends/receives r − 1 data to/from the other r − 1 processors). Due to the owner-computes rule, each of these processors computes one output of the butterfly and stores the result in-place. Note, however, that not all inputs to the butterfly operators Bi and Bi are local, even in the case of the ith digit belonging to the memory field. As an example of a dataflow, in the bottom of Figure 3 we show a block distribution of a data sequence of length N = 9 and r = 3 on a three-processor machine. Thus each data index consists of two digits, the most significant one assigned to the processor coordinate and the least significant one to the memory coordinate. The dataflow is described by the string of operators B1 B2 . In that condition, operator B1 (first stage) is computed locally in each processor (index 1 is in the memory field). However, operator B2 (second stage) requires communications in order to provide the correct input data in each processor. In this case, each processor sends/receives two pieces of data to/from the other two processors. The owner-computes rule implies that processor zero must compute the first output of all butterflies in B2 , while processors two and three compute the second and third outputs, respectively, of such butterflies. As an important conclusion, the parallel performance of the algorithm may improve if its dataflow is reorganized in such a way that all used butterfly operators have their indices belonging to the memory coordinate. This way, these operators will compute butterflies with maximum locality (complete in the case of Bi and almost complete for Bi and Bi ). There are several possibilities in HPF to express this kind of parallelism. One of the methods is the use of an HPF LOCAL procedure, which is executed concurrently in all processors. Variables in an HPF LOCAL procedure are private to the processor which is running that particular instance of the procedure, and thus may store different values on different processors. 2.2.2. Data rearrangement operators To reorganize the algorithm dataflow and obtain, for instance, maximum locality butterfly operators, some intermediate communication stages may be necessary. These stages are described using data rearrangement operators, that will be described in the rest of the section. D EFINITION 2.2. The exchange operator (transposition), Ei,j , applied to the data sequence, exchanges the i and j th digits of the index-digit representation of the sequence, as follows, Ei,j [tn . . . ti . . . tj . . . t1 ] = [tn . . . tj . . . ti . . . t1 ].. (1). Figure 4 shows examples of how to apply this operator to a radix-2 sequence of length N = 8. Turning again to the data-parallel implementation, data rearrangements may be specified into HPF by means of array-sectioning operations. The exchange operator, for instance, corresponds to an array-section swap and thus may be implemented as in the Fortran 90 subroutine shown in Figure 5a (bit-intrinsic procedures are explained, for T HE C OMPUTER J OURNAL,. 0. 1. 2. 3. 4. 5. 6. 7. a. b. c. d. e. f. g. h. E3,1. 0. 1. 2. 3. 4. 5. 6. 7. a. e. c. g. b. f. d. h. Sections permuted: {b,d}. E3,2. 0. 1. 2. 3. 4. 5. 6. 7. a. e. b. f. c. g. d. h. Sections permuted: {c,g}. {e,g}. {b,f}. FIGURE 4. Application of the operator E3,1 and E3,2 to a data sequence of length N = 8 (and r = 2). The rounded data sections are those that permute. SUBROUTINE Exchange(A,i,j,N) REAL A(:) INTEGER i,j,k,N,v(N) FORALL (k=1:N) v(k) = k-1 WHERE ((btest(v,i)) .NEQV. (btest(v,j))) v = ibits(v,i,1)*ibset(v,j) + ibits(v,j,1)*ibclr(v,j) v = ibits(v,i,1)*ibclr(v,i) + ibits(not(v),j,1)*ibset(v,i) v = v+1 A(v) = A ENDWHERE END SUBROUTINE (a) EXTRINSIC (HPF LOCAL) SUBROUTINE PerfectUnshuffle(A,i,j,U) REAL A(:) INTEGER i,j,k,U,v(U) IF (i .LEQ. j) THEN RETURN FORALL (k=1:U) v(k) = INT((k-1)/2j −1 ) v = ishftc(v,i-j,i-j+1) FORALL (k=1:U) v(k) = v(k)*2j −1 + mod(k-1,2j −1 )+1 A(v) = A END SUBROUTINE (b) FIGURE 5. Fortran 90 subroutine for implementing (a) the exchange operator and (b) the perfect unshuffle operator.. instance, in Appendix A of [9]). This subroutine performs the transposition operator Ei,j on an array A of size N. In Fortran 90, an array section can be extracted from a parent array in a completely general way by using vector subscript notations. A vector subscript, v(N), is a onedimensional integer array expression, in which each of its elements has the value of a subscript in the array section being defined. The array sectioning shown in Figure 5a is determined by using a masked array assignment, by means of a WHERE statement. The result is that the assignment is only accomplished for those elements that Vol. 44, No. 4, 2001.

(5) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS. 307. TABLE 1. Meaning of index-digit operators used in the text. Operator. Symbol. Expression. Exchange Reduction. Ei,j . Extension Perfect unshuffle Inverse-digit. ε i,j ρi,j. Ei,j [tn . . . ti . . . tj . . . t1 ] = [tn . . . tj . . . ti . . . t1 ] [tn . . . t2 t1 ] = [tn . . . t2 ], if t1 = 0 [tn . . . t2 t1 ] = ∅, if t1 = 0 ε[tn . . . t2 ] = [tn . . . t2 t1 ] i,j [tn . . . t1 ] = [tn . . . ti+1 tj ti . . . tj +1 tj −1 . . . t1 ] ρi,j [tn . . . t1 ] = [tn . . . ti+1 tj tj +1 . . . ti tj −1 . . . t1 ]. fulfil the condition ((btest(v,i)) .NEQV. (btest(v,j))). As a consequence, the transposition is only performed for those elements whose digits ti = tj . Table 1 provides the definitions of other basic data rearrangement operators. The reduction operator, , transforms a data sequence into another whose length is reduced by a factor of r, by eliminating the least significant digit from its index-digit representation. This operation can also be easily implemented through Fortran 90 arraysectioning operations. For example, in the case of radix r = 2, the reduction operator applied to an array A[N] may be implemented as follows, FORALL (i=0:Q-1) A(i*U+1:i*U+U/2) = A(i*U+1:i*U+U:2), FORALL (i=0:Q-1) A(i*U+U/2+1:i*U+U) = 0.. Taking Table 1 again, the extension operator, ε, corresponds to the reverse reduction operator. The perfect unshuffle operator, , performs a right cyclic shift between two digits in base r representation. This operator may be implemented in Fortran 90 as shown in Figure 5b. This subroutine applies i,j to an input array A. In this paper the perfect unshuffle will be only used to make local data rearrangements. Hence the vector subscript has a size U and the subroutine can be declared as EXTRINSIC (HPF LOCAL). Finally, the inverse-digit operator, ρ, reverses the order of the digits between two positions in the data sequence representation. This operator may be implemented as a combination of exchange operations. Note that if a permutation operator only modifies the memory field then a local data rearrangement is accomplished, with no communications. However, if the processor field is modified then data communication among processors occurs as a consequence of the operator. 3. DIVIDE AND CONQUER ALGORITHMS A classical and well-known strategy to solve problems is the DC paradigm. This method consists of dividing a problem into several subproblems of smaller size, solving each subproblem independently and then combining the results obtained to construct a solution to the original problem. In parallel computation, the DC technique is a powerful method, as subproblems generated at each level may be solved independently and, thus, concurrently. In this section we consider three classes of DC algorithms: T HE C OMPUTER J OURNAL,. successive doubling based, regular trees (standard and extended) and irregular trees (the Barnes–Hut method for the N-body problem). First, however, a parallelization guideline and a performance model are described. 3.1. Parallelization guideline In this section we propose a guideline to apply the parallelization framework to the first two classes of DC algorithms: successive doubling and regular trees. The irregular case is very difficult to formalize. However, our framework can also be applied, as shown at the end of this paper for a particular case. Note that the dataflow of the above two DC classes exhibits the following characteristics. • A sequence of n stages (see Figures 6a and 9a), each one may be described by means of some of the butterfly operators defined in Section 2.2.1. • At a stage, the distance of the input data indices to each butterfly is r times greater (lower) than the distance in the previous stage, r being the radix. Keeping in mind these characteristics, our guideline proceeds as follows. 1. For each algorithm we consider the straightforward formulation of its dataflow using only butterfly operators (see Figure 3). 2. In a data-parallel environment, and independently of how the data sequence is mapped on the machine, some of the butterfly operators will execute locally while others will require data communications. Once data distribution of the input sequence has been decided (by the programmer or externally by the problem), we identify the memory field using the mapping vector. From this field we determine which butterflies in the operator string have indices in it. These operators are executed with maximum locality (see Section 2.2.1). 3. The rest of the butterfly operators have their indices in the processor field. Locality may be optimized by interchanging that index with another in the memory field. As a result, the original string operator (composed only by butterflies) is translated into an equivalent one made of locality-optimized butterflies and index-digit data rearrangement operators (see Section 2.2.2). Vol. 44, No. 4, 2001.

(6) 308. M. A MOR et al.. 4. Each one of the data rearrangement operators in the resulting string operator in step three may be implemented by a specific HPF routine, as was explained in Section 2.2.2. On the other hand, butterfly operators will become calls to HPF LOCAL subroutines. The key point in the above guideline is step three, because there are usually many ways of transforming a string of butterfly operators. The final string operator should imply data communications organized in well-defined stages (this also simplifies its translation into an HPF program). To decide the most suitable transformation it is important to take into account what data movement is implied by each index-digit operator when it is applied to an input data sequence. In addition, we also need to know which of these data movements are best performed (in efficiency terms) in our particular target parallel machine, according to the performance model introduced in the next section. Sometimes, the given performance model is not enough for our particular target machine. In such cases, some actual experiments may be necessary in order to determine the data movements with best performance. As a result, the operators for those data movements should be chosen to build the transformed string operator. As an example, in the case of a mesh-connected machine, we should restrict our communications to those operators that interchange indices between the memory field and only one of the dimensions of the mesh (row or column). This results in parallel communications along rows or columns. 3.2. Performance model We need an execution model in order to characterize and compare the different data-parallel formulations for DC problems that we will introduce in the following sections. The model calculates the total execution time of a parallel program as Ttotal =. TB × k + Tcomm , Q. where Ttotal is the total execution time, TB is the execution time of a butterfly operator B, k is the number of butterfly operators (k = (N × n)/r for most algorithms we have considered) and Tcomm is the cost of communications, which includes communications due to non-local butterflies. Communication cost is proportional to the number of communication steps or messages, assuming that all of them are of similar length. This way, the communication cost for the butterfly operator, Bi , will be 0 Tcomm (Bi ) = N × (r − 1). if i ∈ memory otherwise.. As an example, the communication overhead for an algorithm with a dataflow similar to that of the example in T HE C OMPUTER J OURNAL,. Figure 3, but with n = u + q stages, is given by n Bi Tcomm = Tcomm i=1. = Tcomm. u i=1. . . Bi + Tcomm . Bi. i=u+1. . 0. n . . q×N×(r−1). Tcomm = q × N × (r − 1). Communication cost for the index-digit operators shown in Table 1 may be computed in a similar way,   if i, j ∈ memory 0 Tcomm (Ei,j ) = N/r if i ∈ memory and j ∈ processor   N if i, j ∈ processor 0 if t1 ∈ memory Tcomm (, ε) = 1 if t1 ∈ processor   0 if i, j ∈ memory Tcomm (i,j ) = N if i ∈ memory and j ∈ processor   N if i, j ∈ processor (i−j )/2. Tcomm (ρi,j ) =. Ei−k,j −k . k=0. Assuming a block distribution of the input sequence and tl as the most significant index in the memory field, the above expression becomes Tcomm (ρi,j ) (i − l)(N/r) = (l − j )(N − r) + ((i + j )/2 − l)N. if (i + j )/2 ≤ l otherwise.. 3.3. Successive doubling The successive doubling (SD) method was introduced by Cooley and Tukey [10] to compute the FFT. We refer to this class of algorithms as standard SD. The Wang and Mou algorithm [11] for solving tridiagonal systems is an important example of standard SD. Other examples are the Batcher’s bitonic sort algorithm [12], the Hartley transform [13] and the cosine transform [14]. There are variants of the SD method that we denote as extended SD, which we will consider next. Examples of such variants are the Hockney’s parallel cyclic reduction [15] method, the wavelet transform [16] and filter banks [17]. Finally we will also study the Cooley–Tukey FFT algorithm in full, which is based on the SD method, but includes partial reordering stages. 3.3.1. Standard SD Figure 6a depicts an example of a standard SD dataflow for a data sequence of length N = 8. A straightforward formulation of such dataflow using base r = 2 butterflies operators is B1 B2 B3 (operators applyfrom left to right). In general, the string formula would be ni=1 Bi . Vol. 44, No. 4, 2001.

(7) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS 0 1. 0 1. 0 1. 0 1. 2 3 4. 2 3 4. 2 3 4. 2 3 4. 5. 5. 5. 5. 6 7. 6 7. 6 7. 6 7. B1. B2. 309. B3. (a) 0 1. 0 1. 0 2. 0 2. 0 4. 0 4. 2 3 4. 2 3 4. 1 3 4. 1 3 4. 1 5 2. 1 5 2. 5. 5. 6. 6. 6 7. 6 7. 5. 5. 6 3. 6 3. 7. 7. 7. 7. B1. E 2, 1. B1. E 3, 1. Proc 0 Proc 1 Proc 2 Proc 3. B1. (b) FIGURE 6. (a) Standard successive doubling dataflow for a data sequence of length N = 8. (b) The reorganized dataflow block partitioned for a four-processor computer, given by Equation (2).. Let us assume a block distribution of the data sequence, of arbitrary length N, on a parallel computer. In such a case, butterflies with input data close together (that is, Bi , with i = 1, 2, . . . , u belonging to the memory coordinate) will execute locally, while the rest of them will require data communications. As indicated in Section 3.1, this fact introduces complexities when writing the HPF parallel code. These difficulties may be reduced by inserting in the dataflow, data rearrangement operators that transform nonlocal butterfly operators into local ones. For our standard SD dataflow, after the first u local stages we can insert exchange operators to interchange digits between the processor and memory fields. As a result, indices of the non-local butterfly operators will be shifted to the memory coordinate, and thus they become local. The new dataflow can then be formulated as u i=1. Bi. q . Eu+i,1 B1 u,1 ,. (2). i=1. where u + q = n. As u is the length of the memory field, the first u stages of the algorithm are locally computed. After these stages, each one of the q next stages includes, in this order, an exchange operator, that affects both the memory and processor fields, a local butterfly computation and a local data rearrangement operator u,1 (perfect unshuffle). Figure 6b shows the new dataflow assuming a block data distribution on a four-processor machine. Note that the local perfect unshuffle does not occur for this simple case. A more complete example is outlined in Figure 7, for a data sequence of length N = 128, block distributed on 32 processors. Observe in both examples that the output data sequences appear cyclically reordered with regard to the original order. However, the original order can be obtained by applying the perfect unshuffling operator u times. T HE C OMPUTER J OURNAL,. The communication cost associated with Equation (2) is q u Bi Eu+i,1 B1 u,1 Tcomm (SD) = Tcomm i=1. = Tcomm . u i=1. 0. i=1. . Bi +Tcomm . q i=1. Eu+i,1 B1 u,1 . . q× Nr. N , r which improves the communication cost of the straightforward implementation in a factor of r(r − 1). The above is the best option we have found for the standard SD dataflow. The dataflow described by Equation (2) can now be easily translated into an HPF program, as shown in Figure 8. Routines for computing data rearrangement operators, Exchange() and PerfectUnshuffle(), appear in Figure 5. The Butterfly routine simply implements the specific arithmetic computation associated with the operator. Tcomm (SD) = q ×. 3.3.2. Extended SD Figure 9a depicts an example of an extended SD dataflow, for a data sequence of length N = 16 with r = 2. It is also shown how the data sequence will be mapped on a four-processor machine if a block data distribution is used. A straightforward formulation of such dataflow using the operators described in the previous section is B1 B2 B3 B4 . In general, it would be ni=1 Bi . In principle, extended SD dataflows appear to show communication complexities similar to standard ones. Therefore, a similar solution could be used, that is, by transforming the dataflow in a similar way to that described by Equation (2), but replacing the butterfly operators by the extended ones. Vol. 44, No. 4, 2001.

(8) 310. M. A MOR et al. Proc. Index-digit representation of the data sequence. Mem. 7. 6 5. 4 3. 2 1. B1. 7. 6 5. 4 3. 2 1. B2. 7 7 7 7 7 7 7. 6 5 6 6 6 6. 4 3. 5 4 1 5 4 1 5 2 1 5 2 1. 6 3 6 3. 2 1 2 1. 0000. 0. 0. 0. 0. 0. 0001. 1. 1. 1. 1. 1. 0010. 2. 2. 2. 2. 2. 0011. 3. 3. 3. 3. 3. 0100. 4. 4. 4. 4. 4. 0101. 5. 5. 5. 5. 5. 6. 6. 6. 6. 6. 2 1. E 3, 1. 2 3. B 1 Γ2, 1. 0110 0111. 7. 7. 7. 7. 7. E 4, 1. 1000. 8. 8. 8. 8. 8. 1001. 9. 9. 9. 9. 9. 1010. 10. 10. 10. 10. 10. 1011. 11. 11. 11. 11. 11. 1100. 12. 12. 12. 12. 12. 13. 13. 13. 13. 13. 3 2 3 4. B 1 Γ2, 1. 4 3. E 5, 1. 4 5. B 1 Γ2, 1. 1101 1110. 14. 14. 14. 14. 14. E 6, 1. 1111. 15. 15. 15. 15. 15. 5 4. proc mem. 7. 4. 3. 2 1. 5 6. B 1 Γ2, 1. 7. 4. 3. 2 1. 6 5. E 7, 1. 5. 4. 3. 2 1. 6 7. B 1 Γ2, 1. 5. 4. 3. 2 1. 7 6. FIGURE 7. Outline of the application of the operator string in Equation (2) on a 128-item data sequence, block distributed on 32 processors (circles represent butterfly operations, flat brackets represent exchange operations and arrows represent perfect unshuffle operations). !HPF$ PROCESSORS linear(Q) !HPF$ DISTRIBUTE(BLOCK) ONTO linear :: A ! First, local butterfly stages DO i=1,u CALL Butterfly(A,i) END DO ! Second, exchange-butterfly-unshuffle stages DO i=1,q CALL Exchange(A,u+i,i,N) CALL Butterfly(A,1) CALL PerfectUnshuffle(A,u,1,U) END DO FIGURE 8. HPF program for computing algorithms with the standard SD dataflow.. However, this is not so as, assuming again a block data distribution, all extended butterflies imply data communications. In addition, the message size to be communicated increases with the index i of Bi . For instance, consider the straightforward formulation of the dataflow shown in Figure 9a. In the first stage, each processor needs the following input data sets for computing the two butterflies assigned, {[t4 t3 11 − 0100] if t4 t3 = 00, [t4 t3 00], [t4 t3 01], [t4t3 10]} {[t4 t3 01], [t4t3 10], [t4t3 11], [t4 t3 00 + 0100] if t4 t3 = 11}. T HE C OMPUTER J OURNAL,. (a). Proc. Index-digit representation of the data sequence. Mem. 7. 6. 5. 4. 3. 2. 1. B’’1 Γ3,1. 7. 6. 5. 4. 1. 3. 2. E 4,3. 7. 6. 5. 1. 4. 3. 2. B’’1 Γ3,1. 7. 6. 5. 1. 2. 4. 3. E 5,3. 7. 6. 2. 1. 5. 4. 3. B’’1 Γ3,1. 7. 6. 2. 1. 3. 5. 4. E 6,3. 7 3. 2. 1. 6. 5. 4. B’’1 Γ3,1. 7 3. 2. 1. 4. 6. 5. E 7,3. 4 3. 2. 1. 7. 6. 5. B ’’1. 4 3. 2. 1. 7. 6. 5. B ’’2. 4 3. 2. 1. 7. 6. 5. B ’’3. (b) FIGURE 9. (a) Extended successive doubling dataflow for a data sequence of length N = 16 (r = 2) and its block mapping on four processors. (b) Outline of the application of the operator string in Equation (3) on a 128-item data sequence block distributed on 16 processors (circles represent extended butterfly operations, flat brackets represent exchange operations and arrows represent perfect unshuffle operations).. The first (upper) butterfly uses the data item of index [t4 t3 11−0100], which is in the previous processor, while the second (lower) butterfly needs the data item [t4 t3 00 + 0100] in the next processor. In the second stage, each processor uses the next data sets, Vol. 44, No. 4, 2001.

(9) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS one for each butterfly operator, {[t4 t3 10 − 0100] if t4 t3 = 00, [t4t3 00], [t4 t3 10], [t4 t3 00 + 0100] if t4 t3 = 11} {[t4 t3 11 − 0100] if t4 t3 = 00, [t4t3 01], [t4 t3 11], [t4 t3 01 + 0100] if t4 t3 = 11}, and similarly in the rest of the stages. As a result, the communication cost for the straightforward formulation is given by n Bi Tcomm = Tcomm i=1. = Tcomm . u i=1. . . Bi + Tcomm . 2×(q−1)×(r u −1). n i=u+1. . . Bi. . q×N×(r−1)+Q×(q−r)+r. Tcomm = 2 × (q − 1) × (r u − 1) + q × N × (r − 1) + Q × (q − r) + r, where u + q = n (and Q = 2q ). Considering that message length increases with the index i of Bi , which also represents the stage in the algorithm dataflow, better parallel performance would be expected if such dataflow is reorganized in a reversed order to (2). That is, having the local computations at the end, q i=1. B1 u,1 Eu+i,u. u . Bi .. (3). i=1. This way, we begin by performing the communications due to operator E in the first q stages. Operator u,1 means local rearrangement of data (perfect unshuffle permutation). In the remaining u stages there are no communications because the data required to compute the butterflies are in the same processor. In Figure 9b we schematically illustrate the reorganized dataflow for a data sequence of length N = r 7 block mapped on 16 processors. Extended butterfly operators, B1 , are executed in the first q = 4 stages, followed by a perfect unshuffle permutation and an exchange. At the end of these stages, processor and memory fields appear to be interchanged, i.e. the original block data distribution has been transformed into a cyclic distribution. As a result, the following extended butterfly operators, Bi , will execute with no communications. As an example, consider the case N = 16, r = 2 and a four-processor machine (see Figure 9a). Butterfly operator B3 , in the third stage of the original dataflow, requires the following input data for each processor: Proc.0 :{[000t1], [010t1], [100t1]}, t1 = 0, 1 Proc.1 :{[001t1], [011t1], [101t1]}, t1 = 0, 1 Proc.2 :{[010t1], [100t1], [110t1]}, t1 = 0, 1 Proc.3 :{[011t1], [101t1], [111t1]}, t1 = 0, 1. As q = 2, the data sequence at this third stage is already cyclically distributed. Thus, t2 t1 digits in the index-digit T HE C OMPUTER J OURNAL,. 311. !HPF$ PROCESSORS linear(Q) !HPF$ DISTRIBUTE(BLOCK) ONTO linear :: A ! First, (extended butterfly)-unshuffle-exchange stages DO i=1,q CALL Extended Butterfly(A,1) CALL PerfectUnshuffle(A,u,1,U) CALL Exchange(A,u+i,u,N) END DO ! Second, local butterfly stages DO i=1,u CALL Extended Butterfly(A,i) END DO FIGURE 10. HPF program for computing algorithms with the extended SD dataflow.. representation of the data sequence belong to the processor field, while the first two most significant digits, t4 t3 , belong to the memory field. Hence all the inputs to each butterfly in B3 are in the same processor (although each processor will now execute different butterflies to those of the original dataflow). In general, the communication cost for the new formulation of the extended SD is q u B1 u,1 Eu+i,u Bi Tcomm = Tcomm i=1. = Tcomm . q i=1. i=1. B1 u,1 Eu+i,u . + Tcomm . 2×(q−1)+q× Nr. Tcomm = 2 × (q − 1) + q ×. . . u i=1. . Bi. . 0. N r. Note that the output data sequences appear cyclically reordered with regard to the original order. As in the case of the standard SD, the HPF implementation of extended SDs is obtained in a similar way, as shown in Figure 10. 3.3.3. FFT The Cooley–Tukey FFT algorithm dataflow consists of an initial-digit reversal permutation (inverse-digit operator) followed by a standard SD. Alternatively, the dataflow of the standard SD may be reversed, locating the permutation at the end of the algorithm. We will take this last option with the aim of presenting a situation different from that studied in the previous sections. Observe that, in this case, it is more convenient to interchange the fields processor and memory. That is, the data sequence will be initially distributed in a cyclic way instead of block. Communications associated with the digit reversal permutation degrade the performance significantly. So, a large number of algorithms have been designed [18] to reduce the impact of this permutation. We consider here the selfsorting algorithm because, on the one hand, it provides the resulting sequence in the correct order and, on the other hand, the digit reversal permutation can be decomposed into exchange operations and a local inverse-digit operation. This Vol. 44, No. 4, 2001.

(10) 312. M. A MOR et al.. Mem. Proc. B6 E6,1 6 5 4 3. 2 1. B5 E5,2 1 5 4 3. 2 6. ρ. 1 2 4 3. 5 6. B3. 1 2 3 4. 5 6. B4. 1 2 3 4. 5 6. B5. 1 2 3 4. 5 6. B6. 1 2 3 4. 4,3. !HPF$ PROCESSORS linear(Q) !HPF$ DISTRIBUTE(BLOCK) ONTO linear :: A ! First, butterfly-exchange stages DO i=1,q CALL Butterfly(A,n-i+1) CALL Exchange(A,n-i+1,i,N) END DO ! Second, local digit reversal stages p = ceiling((n-2*q-1)/2) DO i=1,p CALL Exchange(A,u-i,q+i+1,U) END DO ! Third, local butterfly stages DO i=q+1,n CALL Butterfly(A,i) END DO. Index-digit representation of the data sequence. FIGURE 12. HPF program for computing the self-sorting FFT.. 5 6. FIGURE 11. Outline of the application of the operator string in Equation (4) on a 64-item data sequence cyclically distributed on four processors (circles represent butterfly operations, flat brackets represent exchange operations and dashed flat brackets represent inverse-digit operations).. formulation, presented in [7, 19], is q. n Bn−i+1 En−i+1,i ρu,q+1 Bi ,. i=1. (4). i=q+1. where n > 2q. Note that the last u (from q + 1 to n) stages of the algorithm are executed locally as index i belongs to the memory field (assuming as indicated before a cyclic distribution of the data sequence). The first q stages, in contrast, require communications due to the exchange operator. Between both bunches of stages, an inverse-digit operator accomplishes local data rearrangements. Figure 11 illustrates the dataflow of the self-sorting FFT for a data sequence of N = r 6 data on four processors. As before, Equation (4) can be easily translated into a HPF program, as shown in Figure 12. 3.4. Standard regular trees Regular trees are a different and frequent dataflow pattern for DC problems. We consider that in a regular direct tree each node has r inputs and r outputs, but only one of the outputs progresses to the next stage. So, the number of nodes computed in each stage is reduced by a factor of r as we progress through the tree. While computations on nodes are represented by the butterfly operator, the reduction in the number of nodes may be represented by the reduction operator, . Therefore, following the above description, the dataflow of a regular direct tree is similar to that of the standard SD, but inserting a reduction operation between each pair of butterflies. Having in mind the processor and memory coordinates in the index-digit representation, a block data T HE C OMPUTER J OURNAL,. distribution of the data sequence implies that both fields appear in the order (processor, memory). In the first u = n − q stages of the tree algorithm, the reduction operator eliminates a digit from the memory field. After these u stages, the field memory disappears, which means that a unique data item remains in each processor. During the execution of the rest of the stages a digit is eliminated from the processor coordinate, stage by stage. This means that, stage by stage, the number of working processors is reduced by r, and each active processor computes a unique butterfly. This is inefficient in parallel because it implies a large number of short data communications. We may reorganize the tree dataflow in a much more efficient way if a number of short communications are grouped into only one long communication, enough to completely fill the available memory in the processors. That is, in each stage of the tree, the number of working processors is reduced by a factor of r, each active processor computing several butterflies locally. This can be expressed as q u (5) B1 mem M proc,∗ B1 mem . i=1. i=1. In this equation we have assumed, as it is the most usual situation, that the memory field is longer than the processor field. Otherwise, the last two stages in Equation (5) must be iterated q/u times. The mutation operator M proc,∗ gathers all data in each processor and allocates them in processor zero. In other words, it transfers the digits of the processor coordinate to the memory coordinate,. M proc,∗. proc mem proc mem tq . . . t1 - - = - - tq . . . t1 . . . . (6). . This operator can be implemented through Fortran 90 array sections using subscript triplet notation. For example, in the case of a block distribution, the mutation operator M proc,∗ transforms the array A[N] by means of the following sentence, A(1:Q:1) = A(1:N:U). Vol. 44, No. 4, 2001.

(11) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS. Index-digit representation of the data sequence. Proc. Mem. 6 5 4 3. 2 1. B 1Ω. mem. 6 5 4 3. 2. B 2Ω. mem. 6 5 4 3. M. 4 3. B 1Ω. mem. 6 5. 4. B 2Ω. mem. 6 5. M. 6 5 6. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. proc,*. 6 5. proc,*. B 1Ω. mem. B 2Ω. mem. FIGURE 14. Extended regular tree dataflow for a 16-item data sequence.. FIGURE 13. Outline of the application of the operator string in Equation (5) on a 64-item data sequence block distributed on 16 processors (circles represent butterfly operations, flat brackets represent exchange operations and crosses represent reduction operations).. In Figure 13 we present an example of the dataflow described by Equation (5) for a data sequence of 64 items on 16 processors. Following a similar argument, an inverse tree can be formulated by reversing the formula in Equation (5), as u. q εmem B1 M ∗,mem εmem B1 .. i=1. (7). i=1. The mutation operator M ∗,mem is the reverse of M proc,∗ , that is, it scatters the data on processor zero to the rest of the processors, M mem,∗. proc mem proc mem - - tu . . . t1 = tu . . . t1 - - . . . . 313. (8). . As in the case of operator the M ∗,mem operator can be implemented through Fortran 90 array sections using subscript triplet notation, as in the following sentence, A(1:N:U) = A(1:Q:1). M proc,∗ ,. !HPF$ PROCESSORS linear(Q) !HPF$ DISTRIBUTE(BLOCK) ONTO linear :: A ! First, (extended butterfly)-reduction stages DO i=1,u CALL Extended Butterfly(A,1) FORALL (j=0:Q-1) A(j*U+1:j*U+U/2) = A(j*U+1:(j+1)*U:2) FORALL (j=0:Q-1) A(j*U+U/2+1:(j+1)*U) = 0 END DO ! Second, mutation stage A(1:Q)=A(1:N:U) ! Third, (extended exchange)-reduction stage DO i=1,u CALL Extended Butterfly(A,1) A(1:U/(2**i))=A(1:U/(2**(i-1)):2) A(U/(2**i)+1:U/(2**(i-1)))=0 END DO FIGURE 15. HPF program for computing general regular trees.. directly from the operator formulation explained above. Data rearrangement operators, M proc,∗ and mem , appear as array sections using subscript triplet notation as pointed out before. The Extended Butterfly() routine implements the specific arithmetic computations associated with the operator. 3.6. Irregular trees. 3.5. Extended regular trees A variant of the standard tree used in many DC algorithms is what we call an ‘extended tree’. In these trees, the output of some of the nodes can be used as input for more than one node in the following stage. A typical example of an extended tree dataflow is the elimination stage of the cyclic reduction (CR) algorithm [15, 20], shown in Figure 14. The formulation of this type of tree is similar to that of the standard trees (see Equation (5)), but replacing butterflies by extended butterflies. For example, in the case of the CR algorithm, extended butterflies of type Bi are used. As extended trees constitute a general case, including the standard trees, in Figure 15 we show the general HPF program for such general trees. This program derives T HE C OMPUTER J OURNAL,. For irregular DC applications, special data distributions are required to provide both data locality and load balancing. The operational description of the problem and its implementation into a data-parallel language is significantly more difficult than in the problems analysed in the previous sections. Examples of irregular trees appear in many problems in physics, engineering and computer graphics, such as N-body, molecular dynamics, 3-D rendering and radiosity. As a working example of an irregular-tree-structured DC problem, the hierarchical N-body problem is considered. Recent work [21, 22] has shown that hierarchical N-body simulations can be implemented into HPF. All these implementations, however, use non-adaptive and adaptive Vol. 44, No. 4, 2001.

(12) 314. M. A MOR et al. 2D system of 16 bodies P. L. H. E. N B. K. G. D. J. I A. M. O. A. E. L. H. B. F G. M. C. C. F. O. I. J. K. N. P. D quad-tree associated to the system. A(1,2,...,16)={M,C,F,O,I,A,E,L,B,J,K,N,H,P,D,G} Peano-Hilbert ordered data sequence. Peano-Hilbert ordering P. L. E. N B. Peano-Hilbert filling curve. 14. 8. H K. 13. 7. D. 12 9. J. I A. 15. 5 6. O. 4. F. 3 G. M. 11. 10. 16. 1. C. 2. FIGURE 16. Example of a set of bodies in a 2-D space partitioned to build a quadtree, and the corresponding linearization using a PH space-filling curve.. versions of the Anderson method for far-field force computations. This section shows an efficient HPF implementation of a different hierarchical force calculation method, the Barnes–Hut (BH) algorithm [23]. There are significant differences between their implementations and ours. Basically, in our case communications are implemented using array copy sections, obtaining efficient point-to-point data transferences. In addition, HPF LOCAL constructs are used to obtain very efficient node local codes. The BH algorithm is the simplest hierarchical N-body method, widely used in astrophysics. As an astrophysics application simulates a large number of bodies (millions of stars), the BH method employs a DC strategy to reduce the number of particles in the force sum. A simple physical intuition is used: the force contribution of a far enough galaxy can be completely replaced by a single point mass located at its centre of mass. This idea is recursively applied with the help of a tree data structure, an octtree in 3-D or a quadtree in 2-D. The root of a quadtree (octtree) is a square (cube) comprising the whole body domain. This large box is subdivided into four (eight) equal-sized subboxes. This partitioning process is repeated recursively until at most one body is found in each subbox. Basically, the BH method is organized into three stages. First, the quadtree (octtree) for the body domain is built. Second, for each subbox of the tree structure, the centre of mass and the total mass for all the bodies inside are computed. And third, for each body, the tree data structure is traversed to compute all the force contributions following a DC strategy. That is, if the distance from the body to a subbox is much larger than its size, then all the bodies T HE C OMPUTER J OURNAL,. contained are approximated by a single point mass. One typical distance test, known as the multipole acceptability criterion (MAC), is based on checking if the fraction l/d is less than some user-defined threshold [23], where l is the length of a side of the subbox and d is the distance from the body to the subbox centre of mass. 3.6.1. Data-parallel implementation For an irregular problem like the BH method, our parallelization framework is used as follows: Linearization. In our formulation, we linearize the tree data structures (quadtrees or octtrees) through the use of space-filling curves [4]. We use, in particular, a Peano–Hilbert (PH) linear ordering, which has the property that, to a certain degree, two points adjacent on the curve are adjacent in the tree structure. In addition, the PH curve is easy to generate and results in simple indexing. As an example, Figure 16 shows a 2-D ensemble of bodies and its associated quadtree. In the bottom of the figure, the corresponding PH curve is depicted together with the corresponding linear ordering of the bodies. Data distribution. After linearizing the problem, the linear sequence of bodies is block distributed among processors to keep the locality exhibited by the PH curve, and the computations of the simulation are organized as array operations. Communication organization. We have used array copy Vol. 44, No. 4, 2001.

(13) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS sections to organize communications, instead of array indirections or irregular gather/scatter operations, as in [21, 22]. This facilitates the compiler work to generate efficient communication operations. Workload balancing. Finally, as BH contains dynamic computations, a diffusive type workload balancing scheme is included in the parallel algorithm. As an example, Figure 17 represents the complete dataflow of one simulation step of the BH method for a 2-D system. The upper part of this figure shows the original ensemble of bodies, its corresponding quadtree, the PH linearization of the tree and the block partitioning of the resulting linear array of bodies. To simplify the explanation, we have represented the BH method by the following string of operators, end U t =0 i=1. Tz,i ξ ϒz,Y ψ. UU . . . Tz,i F V X .. (9). i=U +1. Note that although the BH dataflow can be organized as a string of operators, these operators cannot be defined as permutations of the index-digit representation of the data sequence. We have to introduce new computational and data arrangement operators. In the above operator string (Equation (9)), the initial product runs over the iterations or steps of the simulation. For each iteration, we have three computing stages. • In the first stage ( U i=1 Tz,i ), each processor builds a local quadtree from the assigned block of bodies. Symbol z represents the root of the local tree, which is built incrementally by adding one by one all the local bodies (U = N/Q bodies in each processor). This is a completely local computation that can be easily written into HPF as follows (operations to build a quadtree are not essential in the explanation of the BH method implementation, and thus were encapsulated into a procedure), DO i = 1, U Tz,i → CALL BuildTree(Tree, A, i) i=1 END DO.. U . • The second stage represents an information exchange, to enable computation of force contributions on the local bodies. Each processor must determine what information it owns is needed by the rest of the processors. This information is obtained by applying the MAC heuristic to benchmark bodies coming from the other processors. In our implementation, each processor broadcasts the extreme subboxes in its local tree. This interchange is represented by operator ξ , while operator ϒ represents the application of the MAC test. The results are stored in three tables, slice, list and list t (see Figure 17). The ith entry of table list stores the number of nodes to be transferred to processor i. The corresponding nodes are stored in table slice. As T HE C OMPUTER J OURNAL,. 315. an example, processor zero in Figure 17 must send the three nodes c, d, g to processor one and the node b to processors two and three. Finally, the ith entry of table list t stores the number of nodes to be received from the processor i. Afterwards, this information is communicated to the corresponding processors by means of a global communication interchange, an all-to-all broadcast (operator ψ). The use of those three tables allows grouping of all communications in each simulation step and, in addition, such communications can be easily expressed in terms of Fortran 90 array operations. No array indirections nor irregular gather operations are needed, obtaining a better communication efficiency. As an example, the three operators explained in this second stage of the simulation step can be translated into HPF as follows, buffer(1 : 2 ∗ Q : 2) = A(1 : N : U ) ξ→ buffer(2 : 2 ∗ Q : 2) = A(U : N : U ), ϒ → CALL MAC(Tree, buffer, slice, list),   list t = transpose(list)      list = sum prefix(list, dim = 1,      exclusive = .true.) + 1    list t = sum prefix(list t, dim = 1,     exclusive = .true.) + U + 1 ψ→  DO i = 1, Q      FORALL(j = 1 : Q)      A(list t (j, i) : list t (j + 1, i) − 1, i)     = slice(list(i, j ) : list(i + 1, j ) − 1, j )    END DO. U • In the third stage ( U i=U +1 Tz,i ), the information exchanged is added to the corresponding local tree, thereby obtaining an extended local tree. U U is the total number of bodies that each processor now has, also considering the owned ones. Finally, each processor is now able to compute forces on the owned bodies, and update their velocities and positions (F , V and X operators). All these operations are also local computations, as in the first stage, and can be written into HPF in a similar way, as follows,   DO k = 1, Q      UU  DO i = U + 1, U U Tz,i → CALL BuildTree(Tree, A, i)   i=U +1  END DO    END DO; F → CALL Force(A), V → CALL Velocity(A), X → CALL Position(A). Figure 18 shows the complete HPF program that directly implements the process explained above. This code is static, that is, no workload balancing is guaranteed. To Vol. 44, No. 4, 2001.

(14) 316. M. A MOR et al.. 2D system of 16 bodies P. L. H K. N. E B. G. D. J. I A. M. O. A. E. L. B. H. P. F G. M. C. C. F. O. I. J. K. D. quadtree associated to the system. N. Peano-Hilbert ordered data sequence. A(1:16)={M,C,F,O,I,A,E,L,B,J,K,N,H,P,D,G} Block partitioning of the data sequence on a 4-processor computer. Processor 1. Processor 0 A(1:4)={M,C,F,O}. A(5:8)={I,A,E,L}. a. c. g. d e. f. C. F. slice(*,0). list(*,0). c d g b b. 0 3 1 1 0. c. d I. O. A. slice(*,1). list(*,1). d e c b f g b e. 3 0 3 2 0. list_t(*,0). 0 3 2 2 0. slice(*,2). list(*,2). b d b d b f g h. 2 2 0 4 0. transpose. list_t(*,1). 3 0 2 2 0. local quadtree. D. g K. N. transpose. list_t(*,2). 1 3 0 4 0. slice(*,3). list(*,3). b f b f d e f c. 2 2 4 0 0. transpose. list_t(*,3). 1 2 4 0 0. list(*,0). list_t(*,0). slice(*,1). list(*,1). list_t(*,1). slice(*,2). list(*,2). list_t(*,2). slice(*,3). list(*,3). list_t(*,3). c d g b b. 0 1 4 5 6. 1 4 6 8 8. d e c b f g b e. 1 4 4 7 9. 1 4 4 6 8. b d b d b f g h. 1 3 5 5 9. 1 2 5 5 9. b f b f d e f c. 1 3 5 9 9. 1 2 4 6 6. ϕ. A(1,...,4,...)={I,A,E,L,C0,D0,G0,B2 B3D2,F3} a. a k. d. E1B2 B3D2 F3 j. g D1. e. f F. h O. i. l. m. b. M. C. J. Y. P. G. e. slice(*,0). A(1,...,4,...)={M,C,F,O,C1,D1,E1B2, B3D2,F3}. c. f. f. d H. e. local quadtree. transpose. b c. d. local quadtree. L. ξ. Number of nodes to be sent to processor 1 (stored in slice(*,0)). c B. g E. a. b. f. A(13:16)={H,P,D,G}. a. e. b. b. Processor 3. A(9:12)={B,J,K,N}. ΠT. a. local quadtree. M. Processor 2. e. b h. i. d f. c. A E. C0 D0 j. m. g L. ΠT. A(1,...,4,...)={B,J,K,N,B0B1,F1,G1, C3,D3,E3,F3}. n. B2. b. B0B1 i j F1. G1. F3 k. d. l. B. k. m. n. C3 D3 E3 e. J. I. g. o. c. C1 G0. a. a h. B3D2 F3 l. A(1,...,4,...)={H,P,D,G,B0B1,E1B2, E2,F2,G2}. g. f K. N. h. b. E2 F2. G2. FVX. FIGURE 17. Dataflow of the BH method for a 16-body 2-D system parallelized on a four-processor computer.. T HE C OMPUTER J OURNAL,. Vol. 44, No. 4, 2001. f. B0B1 E1B2 G e c d i H P D j k l.

(15) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS. 317. !HPF$ PROCESSORS linear(Q) TYPE(BODY), ALLOCATABLE :: A(:) TYPE(QUADTREE), ALLOCATABLE :: Tree(:) INTEGER, ALLOCATABLE :: PH(:) INTEGER :: slice(max,Q), list(Q+1,Q), list t(Q+1,Q) !HPF$ ALIGN PH(i) WITH A(i) !HPF$ ALIGN (*,i) WITH slice(*,i) :: list, list t !HPF$ DISTRIBUTE(BLOCK) ONTO linear :: A, Tree !HPF$ DISTRIBUTE(*,BLOCK) ONTO linear :: slice READ (*,*) A, PH A(PH) = A ! Redistribute A using the PH ordering DO t=0, end ! Iterate the simulation ! First stage: Quadtree building DO i=1,U CALL BuildTree(Tree,A,i) END DO ! Second stage: Extreme subboxes interchange buffer(1:2*Q:2) = A(1:N:U) buffer(2:2*Q:2) = A(U:N:U) ! Second stage: MAC heuristic computation CALL MAC(Tree,buffer,slice,list) ! Second stage: Global communication list t = TRANSPOSE(list) list = SUM PREFIX(list,DIM=1,EXCLUSIVE=.TRUE.) + 1 list t = SUM PREFIX(list t,DIM=1,EXCLUSIVE=.TRUE.) + U + 1 DO i=1,Q FORALL (j=1:Q) A(list t(j,i):list t(j+1,i)-1,i) = slice(list(i,j):list(i+1,j)-1,j) END DO ! Second stage: Local tree expansion DO k=1,Q DO i=U+1,UU CALL BuildTree(Tree,A,i) END DO END DO ! Third stage: Compute forces, update velocities and positions CALL Force(A) CALL Velocity(A) CALL Position(A) END DO FIGURE 18. HPF program for solving N-body problems using the BH method.. overcome this problem we have also implemented a dynamic load balanced version of the BH method, using a diffuse type dynamic scheduling algorithm [24], which preserves some data locality. At the end of each simulation step, the scheduling piece of code tests if the workload imbalance is greater than some threshold. The imbalance is calculated by taking the difference between the loads of the most-loaded and the least-loaded processors divided by the average load in the full system. In the case of high imbalance, the most-loaded processor sends a fraction of its set of bodies to the least-loaded of its neighbouring processors (a linear processor array is considered). The workload is estimated by counting the total number of force contributions with regard to its set of bodies. 4. EVALUATION We have implemented an example set of algorithms that exhibit the dataflows described in previous sections. All programs were written in HPF and compiled using the T HE C OMPUTER J OURNAL,. Portland Group [25] commercial HPF compiler. The experiments were conducted on a Cray T3E multiprocessor. First, we have considered the original FFT (standard SD with inverse-digit permutation) which is straightforwardly formulated as the string B1 B2 . . . Bn ρ (N = 2n is the problem size), as well as the formulation given by Equation (4). Figure 19e presents the speedups for both implementations, for an input data sequence of length N = 220 . The improvement in the parallel performance is significant. Figure 19a depicts the speedup of this algorithm for two problem sizes. Figure 19 also presents the speedups for the parallel cyclic reduction algorithm, PARACR [15] (extended SD dataflow) and the radix-2 cyclic reduction, CR [15, 20] (extended and inverse regular tree dataflows). These numbers have been measured with respect to the straightforward sequential algorithm before dataflow reorganization, and show the low overhead associated with our method. In all cases we obtain superlinearity, due to the Vol. 44, No. 4, 2001.

(16) 318. M. A MOR et al. 16. 32. Speedup. Speedup. 19. 18. 16. N=2 20 N=2. 8. N=2 19 N=2. 8 4 4. ideal. ideal. 2. 2. 1. 1 1. 4. 2. 8. 1. 16. 2. 4. 8. 16. Number of processors. Number of processors. (a). (b). 64. 16. Speedup. Time(s). 32. Communication Stage Computation Stages Total time FFT. 8. 16. N=2 17 N=2. 16. 4 2. 8. 1. 4. 0.5. ideal. 2. 0.25 0.125. 1. 0.0625. 1. 2. 4. 8. 16. 1. 2. 4. 8. 16. Number of processors. Number of processors. (c). (d) 16. Our Method. Speedup. Original Formulation. 8. 4. 2. ideal 1. 0.5. 0.25. 1. 2. 4. 8. 16. Number of processors. (e) FIGURE 19. Speedup for the (a) FFT (standard SD plus index-digit permutation), (b) PARACR (extended SD) and (c) CR (extended and inverse regular trees) algorithms. (d) Execution time for the FFT algorithm for a data sequence of size N = 220 . (e) Speedup of the original dataflow and the reformulated (optimized) dataflow (our method) for the FFT algorithm with a data sequence of size N = 220 .. dataflow reorganization, which introduces high data locality, with a significant reduction in cache misses. The speedup plots in Figure 19 show a severe knee around eight processors. This knee appears for all our test cases due to limited range in sizes of the input datasets used. Figure 19d shows, for instance, the execution behaviour of the FFT algorithm for an input data sequence of length N = 220 . Note that with more than two processors, the communication time increases very slowly with the number of processors, but the computation time reduces much faster at the same time. Thus at the end the total running time is dominated by the communications, which is almost constant (which justifies the slowdown in the speedup). In the case of the BH method, a total of 100 iterations (timesteps) were executed and measured. Input bodies were generated randomly according to the Plummer model with T HE C OMPUTER J OURNAL,. monopolar approximation and potential softening, which is widely used to generate spherical galaxies made up of equal mass bodies. The softening parameter avoids singularities when particles get too close to each other. The code for the particle generator was lifted from the SPLASH2 Barnes application [26], and modified for use with this simulation. Figure 20a shows the performance of the HPF code. Speedups are calculated with respect to the sequential program running on one processor, and 100 timesteps were executed. Superlinearity is obtained as the dataflow reorganization introduces high data locality. Beyond eight processors there is a significant reduction of performance (see Figure 20b) due to the fact that the tree building stage does not speedup quite as well as the other stages. This performance knee depends on the size of the input data set. Concerning the load balancing diffuse algorithm, Vol. 44, No. 4, 2001.

(17) A DATA -PARALLEL F ORMULATION FOR D IVIDE AND C ONQUER A LGORITHMS 16. Speedup. ideal. 8 4. 13. N=2 14 N=2 15 N=2. 2 1 1. 2. 4. 8. 16. Number of processors. (a) 16. Speedup. Rest of Stages Tree-Building Stage. 8 4 2. ideal 1. 1. 2. 4. 8. 16. Number of processors. (b) Processor Count 1 2 4 8 16. No Balance. Balance. 603.4868 321.5060 170.2943 96.5241 86.5910. 603.4868 305.9281 159.2657 92.5561 84.4197. (c) FIGURE 20. (a) Speedup for the BH method (with no dynamic workload balance), (b) breakdown for the tree-building and the rest of stages and (c) execution times (in s) on 16-processor Cray T3E and N = 214 , comparing dynamic load balancing and no balancing schemes.. performance gains are presented in Figure 20c compared to the extreme case of ‘no balance’. 5. RELATED WORK Much attention has been paid in the literature to the formal parallelization of DC algorithms. In [27] a transformational approach is presented to obtain generic parallel DC algorithms, that are further implemented on linear arrays, mesh or hypercube multicomputers. Starting from an operational DC program, mapping sequences to sequences, they apply a set of semantics preserving transformational rules, which transform the parallel control structure of DC into a sequential control flow, thereby making the implicit data parallelism in a DC scheme explicit. A similar approach is used in [28], where a provably correct, architecture-independent family of parallel implementations for a class of data-parallel algorithms is derived, called DH (distributable homomorphisms). The implementations are well-structured SPMD programs with groupwise personalized all-to-all exchanges. These exchanges T HE C OMPUTER J OURNAL,. 319. may be directly translated into MPI using MPI ALLtoall sentences. As a case study, FFT is analysed. A rather different methodology, based on a space–time mapping, is presented in [29]. This work uses a geometric computational model based on coordinate transformations with which time (the schedule) and space (the processor allocation) can be made explicit. This technique may be applied to a class of DC recursions, resulting in a functional program skeleton and its parallel implementation on message-passing multiprocessors. A similar methodology to the one presented in this paper has been developed in [30]. This work designs a framework based on induced permutations to compute the different FFT algorithms. Induced permutations are based on operations on the tensor sum decomposition of the data indices, and represent a generalization of our indexdigit permutations (which are included as a particular case). Induced permutations permit the implementation of a large class of data-sorting procedures, and are used in the work of formulating mixed-radix versions of the FFT. Finally, in [31] a parallel implementation of the Barnes– Hut method is studied, very similar to the one considered in this paper. The parallel method is structured into a sequence of alternating computation and communication phases. The distribution of the bodies among processors is accomplished by means of an orthogonal recursive bisection. Load balancing is introduced in each phase, taking as weights the number of interactions being computed by each processor. Next, each processor builds a locally essential tree, which carries an irregular communication pattern, unknown in advance. To simplify the procedure a global reduction is used to first compute the number of messages destined for each processor. Finally, after building the tree, the processors can proceed to compute without any communication. Our parallel implementation is similar but with the difference that the data distribution is carried out using a space-filling curve. 6. CONCLUSIONS We have presented a data-parallel formulation for problems and demonstrated its use in implementing regular and irregular algorithms. Our experiments provide promising evidence that a data-parallel approach can be used to efficiently solve regular and irregular DC problems without adding new features to the currently available HPF languages and compilers. The methodology we have used to reorganize the algorithmic dataflows provides great flexibility for efficiently exploiting data locality and reducing and optimizing communications. REFERENCES [1] High Performance Fortran Forum (1993) High Performance Fortran language specification. Sci. Prog., 2, 1–170. [2] Koelbel, C. H., Loveman, D. B., Schreiber, R. S., Steele, G. L., Jr. and Zosel, M. E. (1994) The High Performance Fortran Handbook. MIT Press, Cambridge, MA.. Vol. 44, No. 4, 2001.

(18) 320. M. A MOR et al.. [3] Merlin, J. and Hey, A. (1995) An introduction to High Performance Fortran. Sci. Prog., 4, 87–113. [4] Ou, C., Ranka, S. and Fox, G. (1996) Fast and parallel mapping algorithms for irregular problems. J. Supercomput., 10, 119–140. [5] Flanders, P. M. (1982) A unified approach to a class of data movements on an array processor. IEEE Trans. Comput., 31, 809–819. [6] Fraser, D. (1976) Array permutation by index-digit permutation. J. ACM, 23, 298–309. [7] Argüello, F., Amor, M. and Zapata, E. L. (1996) FFTs on mesh connected computers. J. Parallel Comput., 22, 19–38. [8] Amor, M., López, J., Argüello, F. and Zapata, E. L. (1997) Mapping tridiagonal system algorithms onto mesh connected computers. Int. J. High Speed Computing, 9, 101–126. [9] Ellis, T. M. R., Phillips, I. R. and Lahey, T. M. (1995) Fortran90 Programming. Addison-Wesley, USA. [10] Cooley, J. W. and Tukey, J. W. (1965) An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19, 297–301. [11] Wang, X. and Mou, Z. G. (1991) A divide and conquer method of solving tridiagonal systems on hypercube massively parallel computers. In 3rd IEEE Symp. of Parallel and Distributed Processing (SPDP’91), Dallas, TX, pp. 810–817. IEEE Computer Society Press, Los Alamitos, CA. [12] Batcher, K. E. (1968) Sorting networks and their applications. In AFIPS Conf. Proc. 32, Atlantic City, NJ, 30 April–2 May. pp. 307–314. [13] Hartley, R. V. L. (1942) A more symmetrical Fourier analysis applied to transmission problems. Proc. IRE, 30, 142–150. [14] Sánchez, M., López, J., Plata, O. and Zapata, E. L. (1997) An efficient architecture for the in-place fast cosine transform. In IEEE Int. Conf. on Application Specific Systems, Architectures, and Processors (ASAP), Zurich, Switzerland, pp. 499–508. IEEE Computer Society Press, Los Alamitos, CA. [15] Hockney, R. W. and Jesshope, C. R. (1988) Parallel Computers 2. Adam Hilger, Bristol, UK. [16] Trenas, M. A., López, J., Sánchez, M., Argüello, F. and Zapata, E. L. (2000) Architecture for wavelet packet transform with best tree searching. In IEEE Int. Conf. on Application Specific Systems, Architectures, and Processors (ASAP), Boston, MA, pp. 289–298. IEEE Computer Society Press, Los Alamitos, CA. [17] Strang, G. and Nguyen, T. (1997) Wavelets and Filter Banks. Wellesley–Cambridge Press, USA.. T HE C OMPUTER J OURNAL,. [18] Rabiner, L. R. and Gold, B. (1975) Theory and Application of Digital Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. [19] Amor, M. (1997) Divide and Conquer Algorithms on Processor Meshes. PhD Dissertation, Department of Electronics and Systems, Universidad de Santiago de Compostela, Spain. [20] De Groen, P. P. N. (1991) Base-p-cyclic reduction for tridiagonal system of equations. Appl. Num. Math., 8, 117–125. [21] Hu, Y. C., Johnsson, S. L. and Teng, S.-H. (1997) High Performance Fortran for highly irregular problems. In ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’97), Las Vegas, NV, pp. 13–24. ACM Press, New York. [22] McCurdy, C. and Mellor-Crummey, J. (1999) An evaluation of computing paradigms for N-body simulations on distributed memory architectures. In ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’99), Atlanta, GA, pp. 25–36. ACM Press, New York. [23] Barnes, J. and Hut, P. (1986) A hierarchical O(N log N) force-calculation algorithm. Nature, 324, 446–449. [24] Fonlupt, C., Marquet, P. and Dekeyser, J.-L. (1998) Dataparallel load balancing strategies. J. Parallel Comput., 24, 1665–1684. [25] The Portland Group Inc. (1997) pghpf Reference Manual. The Portland Group Inc., Wilsonville, OR. [26] Woo, S. C., Ohara, M., Torrie, E., Sing, J. P. and Gupta, A. (1995) The SPLASH2 programs: Characterization and methodological considerations. In 22nd Ann. Int. Symp. on Computer Architecture (ISCA’95), Santa Margherita Ligure, Italy, pp. 24–26. IEEE Computer Society Press, Los Alamitos, CA. [27] Achatz, K. and Shulte, W. (1995) Architecture independent massive parallelization of divide-and-conquer algorithms. Lecture Notes in Computer Science, 947, 97–127. Springer, Berlin. [28] Gorlatch, S. and Bischof, H. (1998) A generic MPI implementation for a data-parallel skeleton: Formal derivation and application to FFT. Parallel Process. Lett., 8, 447–458. [29] Herrmann, C. and Lengauer, C. (1996) On the space-time mapping of a class of divide-and-conquer recursions. Parallel Process. Lett., 6, 525–537. [30] Seguel, J., Bollman, D. and Feo, J. (2000) A framework for the design and implementation of FFT permutation algorithms. IEEE Trans. Parallel Distrib. Syst., 11, 625–625. [31] Liu, P. and Bhatt, S. N. (2000) Experiences with parallel N-body simulation. IEEE Trans. Parallel Distrib. Syst., 11, 1307–1323.. Vol. 44, No. 4, 2001.

(19)