• No se han encontrado resultados

Técnicas e Instrumentos de recolección de datos

Capítulo II Marco teórico

3.5 Técnicas e Instrumentos de recolección de datos

The second step of the PCFG construction is the computation of the synchronization edges. For that purpose, first we recognize the nodes that are able to synchronize tasks:

– Task nodes (nT) synchronize previous sibling tasks whose dependences match (the next

paragraph explains the algorithm that computes whether two tasks’ dependences match). – Taskwait nodes(nT W) synchronize previous tasks that are child tasks of the current task.

– Barrier nodes(nB) synchronize any previous task in the same binding region.

– Virtual post-synchronization node(nV P S) is a unique node added to every PCFG that needs to

virtually synchronize those tasks that may not be synchronized within the scope of the graph. The methods involved in the tasks synchronization algorithm are described in Figure 3.2. There, given that an inout dependence is equivalent to an out dependence, a task T2

synchronizes a task T1 if the tasks are siblings and one of the following conditions fulfill: a)

T1 designates an out object that T2 designates as in or out (RAW and WAW data hazards

respectively), and/or b) T1designates an in object that T2designates as out (WAR data hazard).

It may not be possible to statically determine if two tasks synchronize because it cannot be asserted if two dependences designate the same object (e.g., dependences of the form var expr). Thus, this process, modeled with function3, can answer {yes, no, unknown}.

Consider NT the set of nodes nT in a given P CF G, and Ndeps the maximum number of

dependence clauses a task directive has. The cost of computing all synchronizations over that P CF G, which means calling synchronizes for each pair of nodes in NT, isOˆNT2‡ Ndeps2.

matchˆd1, d2 ¢¨¨¨ ¨¨¨¨¨ ¨¨ ¦¨¨ ¨¨¨¨¨ ¨¨¨¤ Y ES, ifˆd1 v1, d2 v2, v1 v2 -ˆd1 v k1 , d2 v k2 , k1 k2 N O, if d1 v1 e1 , d2 v2 e2 , v1x v2

, v1, v2are arrays or restrict pointers

U N K, otherwise a3 b ¢¨¨¨¨¦ ¨¨¨¨ ¤ Y ES, if a Y ES- b Y ES N O, if a NO, b NO U N K, otherwise siblingsˆnT1, nT2 ¢¨¨ ¦¨¨ ¤

Y ES, if nT1 , nT2are child tasks of the same task region

N O, otherwise synchronizesˆnT1, nT2 siblingsˆnT1, nT2 , Š ¦d1>outˆnT 1 ¦d2>inˆnT 2 8outˆnT2 matchˆd1, d2 3  ¦d1>inˆnT 1 ¦d2>outˆnT 2 matchˆd1, d2

Synchronization edges have a kind k that may take one of the following values: – strict: a task node nT1 certainly synchronizes in a node n because either:

* n nT W and both are in the same binding region.

* n nBand n is a region that encloses, or is the same region as, the binding region of nT1. * n nT2 and synchronizesˆnT1, nT2 Y ES.

– maybe: a task node nT1 cannot be statically decided to synchronize with nT2 (i.e., synchronizesˆnT1, nT2 UNK)

– post: the synchronization may occur any time after the function ends.

Synchronization edges are computed using a forward data-flow algorithm that defines the tasks live at the entry point, LIT askb N, and the exit point, LOT ask b N, of each node in a PCFG.

A task node nT > LIT askˆn if:

nT > ancestorˆn ,

~§ nœ> predecessorˆn  e ˆnT, nœ, strict > ESˆnœ

A task node nT > LOT askˆn if:

synchronizesˆnT, n ˜NO, UNK -

all matched dependences in n are inputs- nT has unmatched dependences

Additionally, when computing the LOT ask set, those tasks that remain alive because all target’s matched dependences are inputs are singled out. These tasks may be the source dependence of several target tasks with input dependences on the same variables, and definitely synchronize when a taskwait or barrier is reached.

Theorem 1 T SDF AF `L, Tfe is the bounded monotone forward Tasks Synchronization

Data-Flow Algorithm that computes the task synchronizations over a graph G, and consists of: – L = `S  R, Ae is the meet-semilattice[8] that imposes a partial order over all possible

data-flow values in the algorithm, where:

* S b ˜nT > N is a subset of all task nodes, with two special elements: —, the lattice top

element equivalent to the empty set, and–, the lattice bottom element equivalent to S. * R = N  KIN Dœ, whereKIN Dœ ˜strict, maybe, is the set of kind relationships of

two synchronized nodes.

* A = (8, g) is the meet operator that merges flow values and imposes an order over the lattice by using just the first element in the pair representing each data-flow value. The meet operator is used to compute the live tasks at the entry of a noden> N as follows:

LIT askˆn ˆ 

p>predˆn

LOT askˆp, g

The meet operator is monotone. Given the elementsx1,x2,y1andy2, it fulfills:

– Tf =˜f  S  R S R is the family of transfer functions that maps the program behavior

ontoA computing LOT askˆn for each n > N as follows: fˆnT ˆ˜nTœSnTœ> LIT askˆnT , ˆ siblingsˆnT, nTœ

- synchronizesˆnT, nTœ x Y ES,

˜ˆnTœ, strictSnTœ> LIT askˆnT

, siblingsˆnT, nTœ , synchronizesˆnT, nTœ Y ES

8 ˜ˆnTœœ, maybeSnTœœ> LIT askˆnT

, siblingsˆnT, nTœœ

, synchronizesˆnT, nTœœ UNK /*task*/

fˆnT W ˆ˜nTSnT > LIT askˆnT W , siblingsˆnT W, nT,

˜ˆnT, strictSnT > LIT askˆnT W

, siblingsˆnT W, nT /*taskwait*/

fˆnB ˆg, ˜ˆnT, strictSnT > LIT askˆnB /*barrier*/

fˆn ˆ˜nTSnT > LIT askˆn, g /*any other node*/

All transfer functions are monotonic. Given the elementsx and y, they fulfill: xZ y  fˆx Z fˆy

Each transfer function computes the pair `LOT askˆm, SynchronizedT askˆme of a given node m. The first element is the set of tasks that are still live after the execution of m. The second element is the set S R of tasks synchronized in m. E.g., for a task node nT, the transfer function

fˆnT returns a pair where: a) the first element contains those tasks nTœ in the set LIT askˆnT

that, either are not siblings of nT, or are not synchronized in nT (synchronizesˆnT, nTœ x

Y ES), or b) the second element contains those tasks in the set of LIT askˆnT that are siblings

of nT and are certainly synchronized in nT (synchronizesˆnT, nTœ Y ES).

The semi-latticeL is monotone and of finite height (the number of tasks in a program is finite, thus, the number of sets with the different combinations of these tasks is finite). Because of that, the algorithm is guaranteed to converge.

Algorithm 1 shows the high-level iterative algorithm that computes the tasks synchronizations over a PCFG G. The algorithm initializes the root node of the graph with the lattice least upper bound, —. Then, it performs forward traversals over G, computing the LIT askˆn and LOT askˆn sets of each node n, until no data-flow value changes. At this point there may still be live tasks at the exit node of G, which shall be synchronized with the virtual post-synchronization node, nV P S, of the graph.

Algorithm 1High-level algorithm for synchronizing tasks within a PCFG.

1: LIT askˆnEN = LOT asknewˆnEN = —

2: for each n> N LOT asknewˆn = — do

3: worklist= p — p> succˆnEN

4: while!worklist.empty() do

5: worklist= worklist - n

6: LIT askˆn = p>predˆnLOT asknewˆp

7: LOT askoldˆn = LOT asknewˆn

8: LOT asknewˆn = fˆn

9: if LOT askoldˆn x LOT asknewˆn then

10: worklist= worklist8 s — s > succˆn

11: end if

12: end while

13: for each nT > LOT askˆnEX do

14: add edge(nT, nV P S, post, NULL) to G

15: end for

16: end for

As an illustration, Figure 3.3 shows a simplified version of the PCFG resulting from the code in Listing 3.1, a blocked matrix multiplication using OpenMP tasks. The information related to the tasks is drawn in red (task and task creation nodes, and synchronization edges with their corresponding labels). Note the synchronization edge from the task to the task itself tagged as Maybebecause the inout dependence on C i  BS j  BS cannot be statically decided at this point, as its value may vary between task instances (A and B are not considered to compute this edge because both are input dependences. Furthermore, the task escapes its scope because there is no synchronization, so it is connected to the virtual post-synchronization node.

1void matmul depend ( int N, int BS , float A[N] [N] , float B[N] [N] , float C[N] [N] ) { 2 for ( int i = 0 ; i < N; i +=BS )

3 for ( int j = 0 ; j < N; j +=BS ) 4 for ( int k = 0 ; k < N; k+=BS )

5 #pragma omp task private( i i , j j , kk ) \

6 depend( in : A[ i : BS ] [ k : BS ] , B[ k : BS ] [ j : BS ] ) \ 7 depend( inout : C[ i : BS ] [ j : BS ] ) 8 for ( int i i = i ; i i < i +BS ; i i ++) 9 for ( int j j = j ; j j < j +BS ; j j ++) 10 for ( int kk = k ; kk < k+BS ; kk ++) 11 C[ i i ] [ j j ] = C[ i i ] [ j j ] + A[ i i ] [ kk ] * B[ kk ] [ j j ] ; 12}

Listing 3.1: Matrix multiplication using OpenMP tasks (Example task dep.5.c from the specification examples [112])

[3] FunctionCode [30] LoopFor [45] LoopFor [60] LoopFor [65] OmpTask [79] LoopFor [95] LoopFor [111] LoopFor [4] ENTRY [23] i = 0 [31] ENTRY [26] i < N [38] j = 0 TRUE [153] EXIT FALSE [46] ENTRY [41] j < N [53] k = 0 TRUE [150] EXIT FALSE [61] ENTRY [56] k < N [64] TASK_CREATION TRUE [147] EXIT FALSE [59] k += BS [66] ENTRY Create [44] j += BS [29] i += BS [156] EXIT [144] FLUSH [71] ii = i [80] ENTRY [76] ii < i + BS [87] jj = j TRUE [141] EXIT FALSE [96] ENTRY [92] jj < j + BS [103] kk = k TRUE [138] EXIT FALSE [112] ENTRY [108] kk < k + BS

[133] C[ii][jj] = C[ii][jj] + A[ii][kk] * B[kk][jj] TRUE [135] EXIT FALSE [110] kk++ [94] jj++ [78] ii++ [145] FLUSH [143] EXIT Maybe [158] POST_SYNC Post

Figure 3.3: PCFG for code in Listing 3.1.

Documento similar