Capítulo II Marco teórico
3.5 Técnicas e Instrumentos de recolección de datos
The second step of the PCFG construction is the computation of the synchronization edges. For that purpose, first we recognize the nodes that are able to synchronize tasks:
– Task nodes (nT) synchronize previous sibling tasks whose dependences match (the next
paragraph explains the algorithm that computes whether two tasks’ dependences match). – Taskwait nodes(nT W) synchronize previous tasks that are child tasks of the current task.
– Barrier nodes(nB) synchronize any previous task in the same binding region.
– Virtual post-synchronization node(nV P S) is a unique node added to every PCFG that needs to
virtually synchronize those tasks that may not be synchronized within the scope of the graph. The methods involved in the tasks synchronization algorithm are described in Figure 3.2. There, given that an inout dependence is equivalent to an out dependence, a task T2
synchronizes a task T1 if the tasks are siblings and one of the following conditions fulfill: a)
T1 designates an out object that T2 designates as in or out (RAW and WAW data hazards
respectively), and/or b) T1designates an in object that T2designates as out (WAR data hazard).
It may not be possible to statically determine if two tasks synchronize because it cannot be asserted if two dependences designate the same object (e.g., dependences of the form var expr). Thus, this process, modeled with function3, can answer {yes, no, unknown}.
Consider NT the set of nodes nT in a given P CF G, and Ndeps the maximum number of
dependence clauses a task directive has. The cost of computing all synchronizations over that P CF G, which means calling synchronizes for each pair of nodes in NT, isONT2 Ndeps2.
matchd1, d2 ¢¨¨¨ ¨¨¨¨¨ ¨¨ ¦¨¨ ¨¨¨¨¨ ¨¨¨¤ Y ES, ifd1 v1, d2 v2, v1 v2 -d1 v k1 , d2 v k2 , k1 k2 N O, if d1 v1 e1 , d2 v2 e2 , v1x v2
, v1, v2are arrays or restrict pointers
U N K, otherwise a3 b ¢¨¨¨¨¦ ¨¨¨¨ ¤ Y ES, if a Y ES- b Y ES N O, if a NO, b NO U N K, otherwise siblingsnT1, nT2 ¢¨¨ ¦¨¨ ¤
Y ES, if nT1 , nT2are child tasks of the same task region
N O, otherwise synchronizesnT1, nT2 siblingsnT1, nT2 , ¦d1>outnT 1 ¦d2>innT 2 8outnT2 matchd1, d2 3 ¦d1>innT 1 ¦d2>outnT 2 matchd1, d2
Synchronization edges have a kind k that may take one of the following values: – strict: a task node nT1 certainly synchronizes in a node n because either:
* n nT W and both are in the same binding region.
* n nBand n is a region that encloses, or is the same region as, the binding region of nT1. * n nT2 and synchronizesnT1, nT2 Y ES.
– maybe: a task node nT1 cannot be statically decided to synchronize with nT2 (i.e., synchronizesnT1, nT2 UNK)
– post: the synchronization may occur any time after the function ends.
Synchronization edges are computed using a forward data-flow algorithm that defines the tasks live at the entry point, LIT askb N, and the exit point, LOT ask b N, of each node in a PCFG.
A task node nT > LIT askn if:
nT > ancestorn ,
~§ n> predecessorn e nT, n, strict > ESn
A task node nT > LOT askn if:
synchronizesnT, n NO, UNK -
all matched dependences in n are inputs- nT has unmatched dependences
Additionally, when computing the LOT ask set, those tasks that remain alive because all target’s matched dependences are inputs are singled out. These tasks may be the source dependence of several target tasks with input dependences on the same variables, and definitely synchronize when a taskwait or barrier is reached.
Theorem 1 T SDF AF `L, Tfe is the bounded monotone forward Tasks Synchronization
Data-Flow Algorithm that computes the task synchronizations over a graph G, and consists of: – L = `S R, Ae is the meet-semilattice[8] that imposes a partial order over all possible
data-flow values in the algorithm, where:
* S b nT > N is a subset of all task nodes, with two special elements: , the lattice top
element equivalent to the empty set, and, the lattice bottom element equivalent to S. * R = N KIN D, whereKIN D strict, maybe, is the set of kind relationships of
two synchronized nodes.
* A = (8, g) is the meet operator that merges flow values and imposes an order over the lattice by using just the first element in the pair representing each data-flow value. The meet operator is used to compute the live tasks at the entry of a noden> N as follows:
LIT askn
p>predn
LOT askp, g
The meet operator is monotone. Given the elementsx1,x2,y1andy2, it fulfills:
– Tf =f S R S R is the family of transfer functions that maps the program behavior
ontoA computing LOT askn for each n > N as follows: fnT nTSnT> LIT asknT , siblingsnT, nT
- synchronizesnT, nT x Y ES,
nT, strictSnT> LIT asknT
, siblingsnT, nT , synchronizesnT, nT Y ES
8 nT, maybeSnT> LIT asknT
, siblingsnT, nT
, synchronizesnT, nT UNK /*task*/
fnT W nTSnT > LIT asknT W , siblingsnT W, nT,
nT, strictSnT > LIT asknT W
, siblingsnT W, nT /*taskwait*/
fnB g, nT, strictSnT > LIT asknB /*barrier*/
fn nTSnT > LIT askn, g /*any other node*/
All transfer functions are monotonic. Given the elementsx and y, they fulfill: xZ y fx Z fy
Each transfer function computes the pair `LOT askm, SynchronizedT askme of a given node m. The first element is the set of tasks that are still live after the execution of m. The second element is the set S R of tasks synchronized in m. E.g., for a task node nT, the transfer function
fnT returns a pair where: a) the first element contains those tasks nT in the set LIT asknT
that, either are not siblings of nT, or are not synchronized in nT (synchronizesnT, nT x
Y ES), or b) the second element contains those tasks in the set of LIT asknT that are siblings
of nT and are certainly synchronized in nT (synchronizesnT, nT Y ES).
The semi-latticeL is monotone and of finite height (the number of tasks in a program is finite, thus, the number of sets with the different combinations of these tasks is finite). Because of that, the algorithm is guaranteed to converge.
Algorithm 1 shows the high-level iterative algorithm that computes the tasks synchronizations over a PCFG G. The algorithm initializes the root node of the graph with the lattice least upper bound, . Then, it performs forward traversals over G, computing the LIT askn and LOT askn sets of each node n, until no data-flow value changes. At this point there may still be live tasks at the exit node of G, which shall be synchronized with the virtual post-synchronization node, nV P S, of the graph.
Algorithm 1High-level algorithm for synchronizing tasks within a PCFG.
1: LIT asknEN = LOT asknewnEN =
2: for each n> N LOT asknewn = do
3: worklist= p — p> succnEN
4: while!worklist.empty() do
5: worklist= worklist - n
6: LIT askn = p>prednLOT asknewp
7: LOT askoldn = LOT asknewn
8: LOT asknewn = fn
9: if LOT askoldn x LOT asknewn then
10: worklist= worklist8 s — s > succn
11: end if
12: end while
13: for each nT > LOT asknEX do
14: add edge(nT, nV P S, post, NULL) to G
15: end for
16: end for
As an illustration, Figure 3.3 shows a simplified version of the PCFG resulting from the code in Listing 3.1, a blocked matrix multiplication using OpenMP tasks. The information related to the tasks is drawn in red (task and task creation nodes, and synchronization edges with their corresponding labels). Note the synchronization edge from the task to the task itself tagged as Maybebecause the inout dependence on C i BS j BS cannot be statically decided at this point, as its value may vary between task instances (A and B are not considered to compute this edge because both are input dependences. Furthermore, the task escapes its scope because there is no synchronization, so it is connected to the virtual post-synchronization node.
1void matmul depend ( int N, int BS , float A[N] [N] , float B[N] [N] , float C[N] [N] ) { 2 for ( int i = 0 ; i < N; i +=BS )
3 for ( int j = 0 ; j < N; j +=BS ) 4 for ( int k = 0 ; k < N; k+=BS )
5 #pragma omp task private( i i , j j , kk ) \
6 depend( in : A[ i : BS ] [ k : BS ] , B[ k : BS ] [ j : BS ] ) \ 7 depend( inout : C[ i : BS ] [ j : BS ] ) 8 for ( int i i = i ; i i < i +BS ; i i ++) 9 for ( int j j = j ; j j < j +BS ; j j ++) 10 for ( int kk = k ; kk < k+BS ; kk ++) 11 C[ i i ] [ j j ] = C[ i i ] [ j j ] + A[ i i ] [ kk ] * B[ kk ] [ j j ] ; 12}
Listing 3.1: Matrix multiplication using OpenMP tasks (Example task dep.5.c from the specification examples [112])
[3] FunctionCode [30] LoopFor [45] LoopFor [60] LoopFor [65] OmpTask [79] LoopFor [95] LoopFor [111] LoopFor [4] ENTRY [23] i = 0 [31] ENTRY [26] i < N [38] j = 0 TRUE [153] EXIT FALSE [46] ENTRY [41] j < N [53] k = 0 TRUE [150] EXIT FALSE [61] ENTRY [56] k < N [64] TASK_CREATION TRUE [147] EXIT FALSE [59] k += BS [66] ENTRY Create [44] j += BS [29] i += BS [156] EXIT [144] FLUSH [71] ii = i [80] ENTRY [76] ii < i + BS [87] jj = j TRUE [141] EXIT FALSE [96] ENTRY [92] jj < j + BS [103] kk = k TRUE [138] EXIT FALSE [112] ENTRY [108] kk < k + BS
[133] C[ii][jj] = C[ii][jj] + A[ii][kk] * B[kk][jj] TRUE [135] EXIT FALSE [110] kk++ [94] jj++ [78] ii++ [145] FLUSH [143] EXIT Maybe [158] POST_SYNC Post
Figure 3.3: PCFG for code in Listing 3.1.