2. Marco Teórico
2.2 Acerca de los significados
Let us assume for the sake of simplicity that we are required to solve a given partial dierential equation on a 2-d domain subject to given boundary and/or initial conditions. Also suppose that there is a 2-d mesh of triangles which covers this domain. As the problem is computationally extensive we would like to use a parallel solver. So the rst step in this direction is to partition the mesh into a number of subdomains.
Let us suppose the parallel solver assumes that each subdomain is assigned to a single processor of a parallel machine and that all the processors are identical (i.e. we do not consider here, the possibility of a parallel machine consisting of heterogeneous components). We assume that this partition is slightly unbalanced
(but has a reasonably low cut weight) and our job is to balance it dynamically. We wish to do this without re-partitioning the entire mesh from scratch and so we look for a more local approach. This method is described in what follows.
3.2.1 Group Balancing
In this chapter we are interested in those cases where there is a root (also called coarse) and a computational (also called ne) mesh where the latter is a renement of the former. Suppose there are P processors involved and we have a coarse mesh which is already distributed across these processors (or the mesh may itself be generated on these processors in parallel). We dene the weight of a coarse element to be the number of ne elements inside the coarse element and the weight of a coarse edge to be the number of ne mesh edges along the coarse element edge. From now onwards, by a node we shall mean a node of the weighted dual graph (seex2.3) of the coarse mesh (i.e. a coarse element).
Let us recall fromx2.4 the denition of the Weighted Partition Communication
Graph (WPCG) which represents the face adjacency of the partitions in the system (processors that share at least one edge of a coarse element with a given processor are said to be face adjacent to that processor): A WPCG is obtained by having one vertex for every subdomain in the partition and an edge between two vertices if and only if they are face adjacent to each other. The weightwNi of thei
th vertex
is equal to the sum of weights of all coarse elements on the ith processor (which
is the same as the total number of \ne" elements on the ith processor) and the
weightwEij of the edge connecting thei
th and jth processors is equal to the sum of
weights of all coarse element edges on the interpartition boundary between the two processors.
We rst divide the WPCG into two groups either by using processor IDs as in [104] or by using some other bisection method. In fact, we choose to use a weighted version of the method of spectral bisection (see [47]). This method is often considered to be computationally expensive, however it does perform better than most algorithms in minimising the cut-weight. Also the computational expense is not problematic for the WPCG as the number of processors in the system is always assumed to be small compared to the number of coarse elements in the mesh.
xi =u2 i=
p
wNi, where u2 is the Fiedler vector of the matrix S = D
TLD, L is the
weighted Laplacian matrix of the WPCG and D = diag 1 p wN i .
The groups are dened by sorting theP vertices of the WPCG according to the size of their entry in x and placing elements represented by x0
i: i = 1:::n in one
group (with x0
being the sorted vector) and those by x0
i: i = n + 1;:::;P in the
other, withn chosen so that
n X i=1 w0 Ni ? P X i=n+1 w0 Ni
is as small as possible (where w0
Ni is the weight of the vertex represented by x 0
i).
These groups are called Group1 and Group2 respectively. Ideally we would like each group to contain the same number of processors and an equal total weight, but this may not be possible due to large variations in the weights when a mesh is locally rened and the fact that P need not be even. However the cut weight resulting from this bisection is generally relatively small. The group with the higher
averageload per processor is termed as thelargergroup and the other one is called
the smaller group. In the second stage of the algorithm we try to use the idea of
local migration from the \larger" to the \smaller" group so that after migration each group contains approximately the sameaverageweight per processor without there being a signicant increase in the cut-weight.
3.2.2 Local Migration
As mentioned above the groups formed in the last section may not be ideally bal- anced. To balance them we now migrate nodes from the \larger" to the \smaller" group. There are many ways to do this. Due to the non-linear complexity of the Kernighan and Lin algorithm ([65]) we decided to apply the ideas of Fiduccia and Mattheyses ([32]) who have suggested an algorithm for the same purpose but whose complexity is linear.
For obvious reason the \larger" group is called the Sender and the \smaller" group is called Receiver group respectively. The quantity Migtot stands for total
weights of all the nodes which are to be migrated from the Sender to the Receiver group in order to leave them perfectly balanced.
LetN1 and N2 be the number of processors in Group1 and Group2 respectively.
if(Ave1 Ave 2) f Sender = Group2; Receiver = Group1; Migtot =N2 (Ave 2 ?Ave); g elsef Sender = Group1; Receiver = Group2; Migtot =N1 (Ave 1 ?Ave); g & %
Figure 3.1: Calculation of Sender, Receiver and Migtot:
are the average weights per processor in Group1 and Group2 respectively. The calculation of Sender, Receiver and Migtot is shown in Figure 3.1 below.
Note that if we transfer a set of nodes from Sender to Receiver whose combined weight is nearly or exactly equal to Migtot then after the transferring process the
average weights of both the groups will be equal to that of global average Ave, i.e. the two groups will be load balanced. The next issue which must be addressed is that of how much load from which processors in the Sender group should be transferred to which processors in the Receiver group. There are many possible ways of tackling this. Our choice is closely related to that of Vidwanset al. ([104]). Following [104] we dene the concept of candidate processors. Processors in each group that are face-adjacent to at least one processor in the other group are called candidate processors. We only allow the candidate processors to be involved in the actual migration of nodes from Sender to Receiver. LetNtot be the total weights on
all candidate processors of the Sender group. Then if the ith candidate processor
in the Sender group is face adjacent to more than one candidate processor in the Receiver group we migrate nodes to that candidate processor in the Receiver group which has the least weight. The amount of load shifted fromith candidate processor
in Sender group is denoted by Migi and is given by,
Migi = Ni Ntot Migtot; (3.1)
where Ni is the total weight of the ith candidate processor in the Sender group.
gain(k) = X (k;l) 8 > > > > < > > > > : wEk l if l 2jth processor, ?wE k l if l 2ith processor, 0 otherwise. ' & $ %
Figure 3.2: The calculation of gain.
be decided is which nodes of the weighted dual graph (i.e. coarse elements) should actually be transferred. This is naturally accomplished by aiming to transfer those nodes which result in a new cut weight which is as low as possible.
The fundamental idea behind the algorithm for transferring these nodes which minimise the cut-weight is the concept of thegain andgain density associated with moving a node onto a dierent processor. Dene thegain(k)of nodek to be the net reduction in the cost of cut edges that would result if node k were to migrate from ith candidate processor in the Sender group to the jth candidate processor in the
Receiver group. The calculation of gain(k) is shown in Figure 3.2. It is important to observe that in calculating the gain of a node k the sum is taken over all edges which have node k at one end. The gain density of a node is dened as the gain of the node divided by the weight of the node. It may also be pointed out here that we also calculate the gain densities of all the nodes of the jth processor in the Receiver
group (by denition the gain of a node in jth processor is the net reduction in the
cost of cut edges that would result if the node were to migrate fromjth processor to
the ith processor) as these densities are required in the x3.3 below where we move
nodes around between the i;j processor pair to further minimise the number of cut edges whilst retaining the load balance.
The bulk of the work needed to make a move consists of selecting the base node (a node which is about to be shifted from one processor to another processor is called a base node), moving it, and then updating the gain densities of its neighbouring nodes. We solve the rst problem, that of selecting a base node, by choosing the node with the largest gain density on the ith processor and whose weight is less
than or equal to Migi. We shift the node to the jth processor and update the gain
densities of its neighbouring nodes (observe that in general the node k will have three neighbours when we have two-dimensional domains and four neighbours when we have three-dimensional domains) by the logic explained in Figure 3.3 below,
For eachnk which is a neighbour of the node k f
Letpk be the processor to whichnk belongs;
if (pk ==j) then
decrement gain(nk) by 2*wEn k
k;
nd the new gain density of the node nk;
else if(pk ==i) then
increment gain(nk) by 2*wEn k
k;
nd the new gain density of the node nk;
else
increment the edges cut between pk and j by wEn k
k;
decrement the edges cut between pk andi by wEn k
k; g
change the sign of gain(k);
& %
Figure 3.3: Updation of gain densities and edges cut between the processors. which also update the cut weights between the processors which are aected by the move. Observe that, if the gain density associated with the base node is positive, then making that move will not only make the groups closer to load-balance but it will also reduce the total cost of the edges cut in between the two processors involved, and hence between the processors groups. The above logic is repeated for all possible candidate processors in the Sender group.
Now we are in a position to present version one of our group balancing algorithm. The main three steps of this version of the algorithm are summarised in Figure 3.4.