• No se han encontrado resultados

2. Marco Teórico

2.2 Acerca de los significados

Let us assume for the sake of simplicity that we are required to solve a given partial di erential equation on a 2-d domain subject to given boundary and/or initial conditions. Also suppose that there is a 2-d mesh of triangles which covers this domain. As the problem is computationally extensive we would like to use a parallel solver. So the rst step in this direction is to partition the mesh into a number of subdomains.

Let us suppose the parallel solver assumes that each subdomain is assigned to a single processor of a parallel machine and that all the processors are identical (i.e. we do not consider here, the possibility of a parallel machine consisting of heterogeneous components). We assume that this partition is slightly unbalanced

(but has a reasonably low cut weight) and our job is to balance it dynamically. We wish to do this without re-partitioning the entire mesh from scratch and so we look for a more local approach. This method is described in what follows.

3.2.1 Group Balancing

In this chapter we are interested in those cases where there is a root (also called coarse) and a computational (also called ne) mesh where the latter is a re nement of the former. Suppose there are P processors involved and we have a coarse mesh which is already distributed across these processors (or the mesh may itself be generated on these processors in parallel). We de ne the weight of a coarse element to be the number of ne elements inside the coarse element and the weight of a coarse edge to be the number of ne mesh edges along the coarse element edge. From now onwards, by a node we shall mean a node of the weighted dual graph (seex2.3) of the coarse mesh (i.e. a coarse element).

Let us recall fromx2.4 the de nition of the Weighted Partition Communication

Graph (WPCG) which represents the face adjacency of the partitions in the system (processors that share at least one edge of a coarse element with a given processor are said to be face adjacent to that processor): A WPCG is obtained by having one vertex for every subdomain in the partition and an edge between two vertices if and only if they are face adjacent to each other. The weightwNi of thei

th vertex

is equal to the sum of weights of all coarse elements on the ith processor (which

is the same as the total number of \ ne" elements on the ith processor) and the

weightwEij of the edge connecting thei

th and jth processors is equal to the sum of

weights of all coarse element edges on the interpartition boundary between the two processors.

We rst divide the WPCG into two groups either by using processor IDs as in [104] or by using some other bisection method. In fact, we choose to use a weighted version of the method of spectral bisection (see [47]). This method is often considered to be computationally expensive, however it does perform better than most algorithms in minimising the cut-weight. Also the computational expense is not problematic for the WPCG as the number of processors in the system is always assumed to be small compared to the number of coarse elements in the mesh.

xi =u2 i=

p

wNi, where u2 is the Fiedler vector of the matrix S = D

TLD, L is the

weighted Laplacian matrix of the WPCG and D = diag 1 p wN i  .

The groups are de ned by sorting theP vertices of the WPCG according to the size of their entry in x and placing elements represented by x0

i: i = 1:::n in one

group (with x0

being the sorted vector) and those by x0

i: i = n + 1;:::;P in the

other, withn chosen so that

n X i=1 w0 Ni ? P X i=n+1 w0 Ni

is as small as possible (where w0

Ni is the weight of the vertex represented by x 0

i).

These groups are called Group1 and Group2 respectively. Ideally we would like each group to contain the same number of processors and an equal total weight, but this may not be possible due to large variations in the weights when a mesh is locally re ned and the fact that P need not be even. However the cut weight resulting from this bisection is generally relatively small. The group with the higher

averageload per processor is termed as thelargergroup and the other one is called

the smaller group. In the second stage of the algorithm we try to use the idea of

local migration from the \larger" to the \smaller" group so that after migration each group contains approximately the sameaverageweight per processor without there being a signi cant increase in the cut-weight.

3.2.2 Local Migration

As mentioned above the groups formed in the last section may not be ideally bal- anced. To balance them we now migrate nodes from the \larger" to the \smaller" group. There are many ways to do this. Due to the non-linear complexity of the Kernighan and Lin algorithm ([65]) we decided to apply the ideas of Fiduccia and Mattheyses ([32]) who have suggested an algorithm for the same purpose but whose complexity is linear.

For obvious reason the \larger" group is called the Sender and the \smaller" group is called Receiver group respectively. The quantity Migtot stands for total

weights of all the nodes which are to be migrated from the Sender to the Receiver group in order to leave them perfectly balanced.

LetN1 and N2 be the number of processors in Group1 and Group2 respectively.

if(Ave1  Ave 2) f Sender = Group2; Receiver = Group1; Migtot =N2 (Ave 2 ?Ave); g elsef Sender = Group1; Receiver = Group2; Migtot =N1 (Ave 1 ?Ave); g & %

Figure 3.1: Calculation of Sender, Receiver and Migtot:

are the average weights per processor in Group1 and Group2 respectively. The calculation of Sender, Receiver and Migtot is shown in Figure 3.1 below.

Note that if we transfer a set of nodes from Sender to Receiver whose combined weight is nearly or exactly equal to Migtot then after the transferring process the

average weights of both the groups will be equal to that of global average Ave, i.e. the two groups will be load balanced. The next issue which must be addressed is that of how much load from which processors in the Sender group should be transferred to which processors in the Receiver group. There are many possible ways of tackling this. Our choice is closely related to that of Vidwanset al. ([104]). Following [104] we de ne the concept of candidate processors. Processors in each group that are face-adjacent to at least one processor in the other group are called candidate processors. We only allow the candidate processors to be involved in the actual migration of nodes from Sender to Receiver. LetNtot be the total weights on

all candidate processors of the Sender group. Then if the ith candidate processor

in the Sender group is face adjacent to more than one candidate processor in the Receiver group we migrate nodes to that candidate processor in the Receiver group which has the least weight. The amount of load shifted fromith candidate processor

in Sender group is denoted by Migi and is given by,

Migi =  Ni Ntot  Migtot; (3.1)

where Ni is the total weight of the ith candidate processor in the Sender group.

gain(k) = X (k;l) 8 > > > > < > > > > : wEk l if l 2jth processor, ?wE k l if l 2ith processor, 0 otherwise. ' & $ %

Figure 3.2: The calculation of gain.

be decided is which nodes of the weighted dual graph (i.e. coarse elements) should actually be transferred. This is naturally accomplished by aiming to transfer those nodes which result in a new cut weight which is as low as possible.

The fundamental idea behind the algorithm for transferring these nodes which minimise the cut-weight is the concept of thegain andgain density associated with moving a node onto a di erent processor. De ne thegain(k)of nodek to be the net reduction in the cost of cut edges that would result if node k were to migrate from ith candidate processor in the Sender group to the jth candidate processor in the

Receiver group. The calculation of gain(k) is shown in Figure 3.2. It is important to observe that in calculating the gain of a node k the sum is taken over all edges which have node k at one end. The gain density of a node is de ned as the gain of the node divided by the weight of the node. It may also be pointed out here that we also calculate the gain densities of all the nodes of the jth processor in the Receiver

group (by de nition the gain of a node in jth processor is the net reduction in the

cost of cut edges that would result if the node were to migrate fromjth processor to

the ith processor) as these densities are required in the x3.3 below where we move

nodes around between the i;j processor pair to further minimise the number of cut edges whilst retaining the load balance.

The bulk of the work needed to make a move consists of selecting the base node (a node which is about to be shifted from one processor to another processor is called a base node), moving it, and then updating the gain densities of its neighbouring nodes. We solve the rst problem, that of selecting a base node, by choosing the node with the largest gain density on the ith processor and whose weight is less

than or equal to Migi. We shift the node to the jth processor and update the gain

densities of its neighbouring nodes (observe that in general the node k will have three neighbours when we have two-dimensional domains and four neighbours when we have three-dimensional domains) by the logic explained in Figure 3.3 below,

For eachnk which is a neighbour of the node k f

Letpk be the processor to whichnk belongs;

if (pk ==j) then

decrement gain(nk) by 2*wEn k

k;

nd the new gain density of the node nk;

else if(pk ==i) then

increment gain(nk) by 2*wEn k

k;

nd the new gain density of the node nk;

else

increment the edges cut between pk and j by wEn k

k;

decrement the edges cut between pk andi by wEn k

k; g

change the sign of gain(k);

& %

Figure 3.3: Updation of gain densities and edges cut between the processors. which also update the cut weights between the processors which are a ected by the move. Observe that, if the gain density associated with the base node is positive, then making that move will not only make the groups closer to load-balance but it will also reduce the total cost of the edges cut in between the two processors involved, and hence between the processors groups. The above logic is repeated for all possible candidate processors in the Sender group.

Now we are in a position to present version one of our group balancing algorithm. The main three steps of this version of the algorithm are summarised in Figure 3.4.

3.3 Further Re nement of the Algorithm : Lo-