• No se han encontrado resultados

Solving parallel problems by OTMP model

N/A
N/A
Protected

Academic year: 2017

Share "Solving parallel problems by OTMP model"

Copied!
12
0
0

Texto completo

(1)

F. Pioli M. Printista

J.A. Gonzalez C. LeonJ.L.RodaC. Rodrguez F. Sande

UniversidadNaionaldeSanLuis. Cen troSuperiordeInformatia.

EjeritodelosAndes950,SanLuis,Argentina. UniversidaddeLaLaguna,Tenerife,Spain.

e-mail: fmpioliunsl.edu.arg

Abstrat

Sinetheearlystagesofparallelomputing,oneofthemostommonsolutionstoin

tro-dueparallelismhasbeentoextendasequentiallanguagewithsomesortofparallelversion

oftheforonstrut,ommonlydenotedasforallonstrut. Althoughsimilarsyntax,these

forall loopsdierin theirsemantis andimplementations. TheHigh PerformaneFortran

(HPF) andOpenMPversionsare, likely,among themostpopular. This paperpresentsyet

anotherforallloopextensionfortheClanguage.

Inthiswork,weintrodueaparallelomputationmodel: OneThreadMultipleProessor

Model(OTMP).Thismodelproposesanabstratmahine, aprogrammingmodelandost

model. Theprogrammingmodeldenesanotherforalllooponstrut,thetheorialmahine

aims forbothhomogeneoussharedanddistributed memoryomputers,andtheostmodel

allowsthe predition of the performane of a program. OTMP does not only integrates

and extends sequential programming, but also inludes and expands the messagepassing

programmingmodel. Themodelallowsandexploits anynestedlevelsofparallelism,taking

advantageofsituationswhere thereareseveralsmallnestedloops.

Keywords: Computation Model,AbstratMahine,ProgrammingModel,CostModel.

1 Introdution

Whenompared parallelomputingwith thatof sequential omputing,the urrentsituation is

dierent. No single model of parallel omputation has yetome to dominate developments in

parallelomputingin the waythat vonNeumannmodel has dominatedsequential omputing.

The aims of parallel omputing model are: to desribe lasses of arhiteturesin simple and

realisti terms and, to propose the design methodology of parallel algorithm. It provides an

abstratview of boththetehnologiesand appliations. An abstrat modeldenes suh asthe

algorithmsaredesignedand analyzed intheabstrat modeland odedinprogrammingmodel.

The design of parallel programsrequires fany solutions that are not present in sequential

programming. Thus,adesignerofparallelappliationsisonernedwiththeproblemofensuring

theorretbehaviorof all theproesses thattheprogram omprises.

Generally, a programmingmodel desribesa programmer's view of the parallel systems. It

has a dierent meaning in other areas of omputer siene, but in partiular, it often stands

for omplete semantis of a language. A good parallel programming model has to satisfy the

next properties: Programmability,Reetivity, Cost model, Arhiteture independene and

exibility[8℄.

In this paper, wepresent OTMP omputation model,its abstrat mahine and, its

(2)

eneswith theirurrent versions.

Thenextsetionpresentsthetheorialmahine,theprogrammingmodel: syntaxandsemanti,

and the assoiated omplexity model. Setion three deals load balaning issues and mapping

andshedulingpoliiesused. Finally,thesetionfourshows manyexamplesandomputational

results.

2 OTMP Model

One of the most ommon solutions to introdueparallelism has been to extend a sequential

language with a short set of parallel onstrutors. Suh onstrutor are a versionof the for

onstrut, ommonly denoted as forall onstrut. This model proposes another forall loop

extension for the C language. These extensions are aimed both for homogeneous distributed

memoryand sharedmemoryarhitetures.

Thismodelhasseveral dierenes withother model[6 ℄[11 ℄. Themost important are:

Theparallelprogrammingmodelintroduedhasassoiatedaomplexitymodelthatallows

the analysisand preditionof theperformane.

Itallowsandexploitsanynestedlevelsofparallelism,takingadvantageofsituationswhere

there areseveral smallnestedloops: althougheahloopdoesnotprodueenoughworkto

parallelize,theirunionsuÆes. Thereursivedivideand onqueralgorithmsarethemost

paradigmati example.

Itdoesnotonlyintegratesandextendsthesequentialprogrammingmodelbutalsoinludes

and expands themessage passingprogrammingmodel.

2.1 Abstrat Mahine

OTMP theorial omputeris a mahine omposed ofa numberofinnite proessors,eah one

with its own private memory and a networkinterfaeonneting them. The proessors are

organized ingroups. At any time, the memorystate of all theproessors inthe same group is

idential. An OTMP omputation assumes that they also havethe same input data and the

same program in memory. The initial group is omposed of all the proessors in the mahine.

Eah proessor is a RAM mahine [1℄, the only dierene among them is an internalregister,

theNAME oftheproessor. Thisregisterisnotavailabletotheprogrammer.Aordingtothis

denition,every real parallelmahineis aOTMP mahine.

2.2 Syntax and Semanti

Theprogramming model being introdued, extends the lassisequential imperative paradigm

witha newonstrut: parallelloops. The modelaims forbothdistributedand sharedmemory

arhitetures.The implementationon thelastan be onsideredthe\easypart" ofthe task.

The programmer withtheparallelloops

forall(i=rst;i<=last;(r[i℄,s[i℄))

ompound_statement_i

states that the dierent iterations i of the loop an be performed independently in parallel.

(r[i℄,s[i℄)isthememoryareaoftheresultsofthei-thiteration,wherer[i℄pointstorstpositions

ands[i℄ is thesizeinbytes.

Whenreahingtheformerforall loop,eahproessordeidesintermsofits NAME the

(3)

anyproessorNAME rulingthedivisionproess are:

Numberofiterationstodo: M=l ast first+1 (1)

Iterationtodo: i=first+NAME%M (2)

for(j=1;j<M;j++)

neighbor[j℄=+(NAME+j)%M (3)

whereisgivenby:=M(NAME=M)

Newvalueofregister: NAME=NAME=M (4)

These equations deide the mapping of proessors to tasks, and howthe exhangeof results

willtake plae. After theforallis nished,the proessorsreovertheirformervalueof NAME.

Eah time a forall loop is exeuted, the memory of the group, up to that point ontains

exatly the same values. At suhpoint the memory is divided in twoparts: the one that is

going to be modied and theone that isnot hangedinside the loop. Variables inthe lastset

area vailableinside theloop forreading.The othersarepartitioned among thenew groups.

Thelastparameterintheforall hasthepurposeto informthenew\ownership"of thepart

of the memory that is going to be modied. It announes that the group performingthread i

\owns"(and presumablyisgoing to modify)the memoryareas delimitedby (r[i℄;s[i℄). r[i℄+j

isa pointerto thememoryarea ontainingthe j-th resultof thei-th thread.

Toguarantee thatafter returningto the previousgroup,the proessors inthe father group

haveaonsistentviewofthememoryandthattheywillbehaveasthesamethread,itisneessary

to providethe exhangeamong neighbors of the variablesthat w eremodied inside the forall

loop. Let us denote the exeution of the body of the i-th thread (ompound statementi) by

T

i

. The semantiimposes two restritions:

1. Given two dierentindependentthreadsT

i and T

k

and two dierent result itemsr[i℄ and

r[k℄,it holds:

[r[i℄;s[i℄℄ T

[r[k℄;s[k℄℄=;8i;k

2. Foranythread T

i

and any result j, all the memoryspae dened by[r[i℄+j;s[i℄℄ hasto

be alloated previously to the exeution of the thread body. This makes impossiblethe

use ofnon-ontiguous dynamimemorystrutures.

The programmer has to be speiallyonsious of the rst restrition: it is mandatory that

theaddress of an ymemory ellwritten during theexeution of T

i

has to belong to one of the

intervals in the list of results for the thread. As an example, onsider the ode in gure 1.

1 forall(i=1; i<=3; (ri[i℄, si[i℄))

2 f ...

3 forall(j=0; j<=i; (rj[j℄, sj[j℄))f

4 int a, b;

5 ...

6 if (i % 2 == 1)

7 if (j % 2 == 0)

8 send(j+1, a, sizeof(int));

9 else reeive(j-1, b, sizeof(int));

10 ...

11 g

12 ...

13 g

Figure 1: Two nestedforalls

(4)

theirloalmemory. Applyingequations1,2,3and4,theparallelloopinline1ofgure1divides

thegroup inthree. After theexeution of the loop, and to keep theoherene of the memory,

eahproessorexhangeswithitstwoneighborstheorrespondingresults. Thus,proessor0in

thegroup exeutes iteration i=1 and sends its memory area, (ri[1℄,si[1℄) to proessors 1 and

2. Furthermore, itreeivesfromproessor1 (ri[2℄,si[2℄) andfromproessor2(ri[3℄, si[3℄). The

sameexhangeisrepeatedamongtheother orrespondingtriplets( (3;4;5);(6;7;8);:::) . Every

new nested forall reates/strutures the urrent subgroup aording as a M-ary hyperube,

whereM isthenumberofiterationsintheparallelloopand theneighborhoodrelationis given

byformula3. Thus,thisrstforallprodues\thefae"ofaternaryhyperubidimension,where

everyornerhastwoneighbors. Theseond nestedforall atline3requestsfordierent number

Figure 2: The mappingassoiatedwiththetwo nestedforallingure 1

of threads inthe dierent groups. Forexamples, to thelast group (i=3) exeutes a forallof

size4,and onsequentlythegroup ispartitioned in4 subgroups. In this4-arydimension, eah

proessor is onneted with 3 neighbors in the other subgroups. Therefore, at the end of the

nestedompoundstatement, proessor 17 willsend its data, rj[1℄, to proessors 14, 20 and 23

andwillreeive from them theirdatarj[0℄, rj[2℄ and rj[3℄.

2.3 Others Clauses

TheOTMP modeladmitsseveral oherentextensions,inthissetionwepresent tw o:redution

lausesand globalommuniation.

The redution lauses havesyntax and semanti similar to the aboveforall, its simplied

syntax is

forall R(i=rst;i<=last;(r[i℄,s[i℄);fr[i℄;(rr[i℄,sr[i℄))

f

ompound_statement_i;

fr;

g

The forallR works of the same waythat forall, divides the proessors, exeutes in parallel

ompound statement i overowns data and exhangesthe results, r. When all is done, eah

proessorreduestheresultsthroughf r. The resultof f r isrr and ithassizesr. Thef r an

be any funtion,butit ismandatorythat ithasto be ommutativeand assoiative.

OtherOTMPlauses aretheglobalommuniationlauses: r esult and r esultP.Their

(5)

ery proessors in theother groups, and r esultP ommuniates data to every proessors inthe

other groups, but the data are dierent and depend of the reeptor proessor's group. The

orresponding syntaxis

result(i,data[i℄,s[i℄)

result P(i,data to[i℄,sto[i℄,data from[i℄, s from[i℄)

whereiidenties thetask,T

i

. In therst lauses,data[i℄ pointsto thedataof T

i

ands[i℄ is its

size. In the seond lauses, datato[i℄ is a pointer to the memory area ontaining the data to

sendto T

i

,ands to[i℄ontainsitssize. datafrom[i℄ands from[i℄havethesamefuntion,but

to reeived data of T

i

. These lauses are similarto funtions gather and satter in standards

librarysuhasMPI[10 ℄,butoneofdierenesistheneighborhoodrelation,ittakesplaeevery

thattheexhangingofdata isneessary.

2.4 Cost Model

In this setion, w eanalyze the ost of the forall loop. A similar analysis is required to other

lauses.

TheomplexityT(P)ofanyOTMP programP anbeomputedinwhatrefertosequential

ordinaryonstruts(while, for, if,... ) asintheRAMmahine[1 ℄. Theost oftheforallloop

isgiven by thereursive formula:

T(forall)=A

0 +A 1 M+ e 2 max i=e 1 ( T(T

i ))+ e2 X i=e 1 (gN

i

)+L (5)

whereT

i

istheodeofompound statement i,M isthesizeoftheloopandN

i

isthesizeofthe

messagetransferred initerationi,

M =(e

2 e

1

+1) N

i = P m k=1 s ki A 0

istheonstant timeinvestedomputingformulas1,2and4. A

1

isthetimespentndingthe

neighbors(formula3). Constantgistheinverseof thebandwidth,andListhestartuplateny.

Assuming the rather ommon ase in whih the volumeof ommuniationof eahiteration is

roughlythesame

N N

i

N

j

8i6=j =e

1 ;e

2

Fromformula 5 and from the fat that in urrent mahines the shedulingand mapping time

A

0 +A

1

M isdominatedbytheommuniationtime P

e

2

i=e1 (gN

i

+L),itfollows theresult

thatestablisheswhen an independentfor loop isworthto onvert inaforall:

T(forall)T(for),M (gN)+L+ e2 max i=e1 T(T i ) e2 X i=e1 T(T i ) (6)

AremarkablefatisthattheOTMP mahinenotonlygeneralizesthesequentialRAMbutalso

theMessage PassingProgramming Model. Lines 6-8 in gure 1 illustrate the idea. Eah new

forall \reates"aommuniator(intheMPIsense). Theexeutionoflines7and8inthefourth

group(i=3)impliesthat thread j=0sendsato thread j=1. Thisoperation arriesthat, at the

same timethat proessor2 sends its repliaof ato proessor 5,proessor14 sends its opyto

proessor 17 and so on. To summarize: anysendor reeive is exeuted bythe innite ouples

involved.Still,thetwoaforementionedonstraintshavetobetrue.Anyvariablemodiedinside

(6)

Unfortunately, an innite mahine, suhas that desribed in the previous setion, is only an

idealization. Real mahines ha vea restrited number of proessors. Eah proessor has an

internal register NUMPROCESSORS where stores the number of availableproessors in its

urrent group. Three dierent situations have to be onsidered in the exeution of a forall

onstrutwith M =e

2 e

1

+1threads:

NUMPROCESSORSisequalto1;

M islargeror equalthanNUMPROCESSORS;

NUMPROCESSORSislargerthanM;

The rst ase is trivial. Thereis only one proessor that exeutes all the threads sequentially,

andthere is notopportunityto exploitanyintrinsiparallelism.

The seond ase has been extensively studied as at parallelism. The main problem that

arises is the load balaning problem. Todeal with it, several sheduling poliies havebeen

proposed[11 ℄. Manyassignmentpoliiesarealsopossible: blok,yli-blok,guided ordynami.

Thethirdasewasstudiedintheprevioussetion. Butthefatthatthenumberofavailable

proessorsislargerthanthenumberofthreadsintroduesseveraladditionalproblems: therst

isload balaning. The seondis that, notanymore,the groupsare dividedinsubgroupsof the

samesize.

If a measure of the workw

i

per thread T

i

is available,the proessors distribution poliy

viewed in the previous setion an be modied to guarantee an optimal mapping [3 ℄. The

syntax of theforall isrevisitedto inlude thisfeature:

forall(i=rst; i<=last; w[i℄; (r[i℄, s[i℄))

ompound_statement_i

If there are not weights speiation, the same w orkload is assumed for every task. The

semantiis similarto thatproposedin[2 ℄. Therefore,themapping isomputed aording to a

poliysimilartothatskethedin[3 ℄.Thegure3showstheappliedmappingalgorithm. There

is, however, the additional problem of establishing the neighborhood relation. This time the

simpleone to (M 1) hyperubirelationoftheformersetion doesnothold. Instead,the

hyperubi shapeis distorted to a polytope holding the property that eah proessor in every

grouphasone and onlyone inomingneighborinanyof theother groups.

1 for(i = 0; i< M; i++)

2 p_i = 1;

3 for( j = M+1 ; j< NUMPROCESSORS; j++)

4 f

5 Get p_k suh as wk

pk

maxi= 2;:::; M wi

pi ;

6 p_k= p_k + 1;

7 g

Figure3: Optimal Mapping Algorithm

Thisalgorithm satiseswith thepostulate of optimalmapping.

4 Examples

Many examples havebeen hosento illustrate the use of the OTMP model: Matrix

Multipli-ation,Fast FourierTransform, QuikSortand ParallelSort byRegular Sampling. The results

arepresented forseveral mahinesand we usetheurrentsoftware systemthat onsistsof a C

(7)

Theproblemtosolveistoomputetasksmatrixmultipliations(C i =A i B i

i=0;:::tasks

1). Matrix A i

and B i

haverespetivelydimensionsmq

i and q

i

m. Therefore, the produt

A i

B i

takesanumberofoperationsw[i℄ proportionaltom 2

q

i

. Figure4showsthealgorithm.

VariablesA,B and C arearrays of pointers to thematries. The loop inline 1deals withthe

dierent matries, the loops in lines 5 and 7 traversethe rows and olumns and nally, the

innermost loop in line 8 produes the dot produt of the urrent rowand olumn. Although

allthe for loopsareandidates to beonvertedto forall loops, we willfouson twoases: the

parallelizationofonlytheloopinline5(labeledFLATintheresults,gure5)andtheoneshown

ingure4 whereadditionally,theloopat line1is also onverted to aforall (labelNESTEDin

gure5). Thisexample illustratesone of the ommon situationwhere you an take advantage

1 forall(i = 0; i < tasks; w[i℄; (C[i℄, m * m))

2 f

3 q = ...;

4 Ci = C+i; Ai = A+i; Bi = B+i;

5 forall(h = 0; h < m; (Ci[h℄, m))

6 f

7 for(j = 0; j < m; j++)

8 for(r = &Ci[h℄[j℄, *r=0.0, k=0; k<q; k++)

9 *r += Ai[h℄[k℄ * Bi[k℄[j℄

10 g

11 g

Figure 4: Exploiting2levels of parallelism

of nested parallelism: when neither the inner loop (lines 5-10) nor the external loop (line 1)

haveenough w orkto havea satisfatory speedup, but the ombination of both does. We will

denote bySP

R

(A) the speedup of an algorithm A with R proessors and byT

P

(A) the time

spentexeutingalgorithmAonP proessors. Letusalsosimplifytotheasewhenalltheinner

tasks takethe same time, i.e. q

i

=m. Under thisassumptions, the previousstatement an be

rewrittenmore preisely:

Lemma1

Letbe

tasks<P, SP

P

(inner)<SP

P=tasks

(inner) and tasks(gm 2

)+LT

P=tasks

(inner)

then

SP

P

(FLAT)<SP

P

(NESTED)

Thegure5showstheexpressedbeforeforthisexample. Bothmatrixhave dimensions4545.

Thenumberof taskis 8,and to when thenumberof proessors is 2 and 4,there are proessor

virtualization. In allase and to dierentarhitetures, thenestedparallelismworkswell.

Moreover, when the size of the problems to solveis large, the speedup grows to be

quasi-lineal. We hek that the speedup reahed is linear asit showsgure 6. In thenext examples

wewillonentrateinthemore interestingasewherethenestedloopsaresmalland,aording

to theomplexityanalysis,there arefew opportunitiesforparallelization.

4.2 Fast FourierTransform

The Disrete Fouriertransform is widely used insolving problems in siene and engineering.

Itis denedas

(8)

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0

2

4

6

8

10

12

14

16

Time (secs.)

No. of processors

Matrix N=46 Tasks=4 Cray T3E

FLAT

NESTED

(a)Cra yT3E

2

3

4

5

6

7

8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

NP

Time

Flat

Nested

(b)Origin2000

Figure5: Nested versusFlatparallelism

0

2

4

6

8

10

12

14

16

0

2

4

6

8

10

12

14

16

Speedup

No. of processors

Matrix N=500 Tasks=4 Cray T3E

Large Size

Figure6: Speedupreahed forlargesizeproblems

TheOTMPodeingure7implementstheFastFourierTransform(FFT)algorithmdeveloped

byTukeyandCooley[4℄. ParameterW

k =e

2ik=N

ontainsthepowersoftheN-throot ofthe

unity.

Thegure8 showstheomputationalresultsforsuhimplementationonaSGIOrigin2000

andCRAY T3E.Thespeedupurve behavesaordingtowhataomplexityanalysisfollowing

formula 5 predits. Moreover, when thesizeof problem issmall, theommuniations inuene

ontheperformane.

4.3 QuikSort

The well-known quiksort algorithm [5 ℄ is a divide-and-onquer sorting method. As suh,it

is amenable to a nested parallel implementation. This example is speially interesting,sine

the size of the generated sub-vetors are used as weights for the forall, see gure 9, line 14.

Rememberthat,dependingonthegoodnessofthepivothosen,thenewsubproblemsmayhave

ratherdierentweights.

Figure 10presentsthespeed-upson adigitalAlpha Server, anOrigin2000, an IBM SP2,a

(9)

unsigned N,unsigned stride,Complex *D)f

2 Complex *B, *C, Aux, *pW;

3 unsigned i, n;

4

5 if(N == 1) f

6 A[0℄.re = a[0℄.re;

7 A[0℄.im = a[0℄.im;

8 g

9 else f

10 n = (N >> 1);

11 forall(i = 0; i <= 1;(D+i*n , n))

12 f

13 llFFT(D+i*n,a+i*stride,W ,n, stri de< <1,A +i* n);

14 g

15 B = D; /* Combination phase */

16 C = D + n;

17 for(i = 0,pW = W; i<n; i++,pW += stride) f

18 Aux.re = pW->re*C[i℄.re - pW->im*C[i℄.im;

19 Aux.im = pW->re*C[i℄.im + pW->im*C[i℄.re;

20 A[i℄.re = B[i℄.re + Aux.re;

21 A[i℄.im = B[i℄.im + Aux.im;

22 A[i+n℄.re = B[i℄.re - Aux.re;

23 A[i+n℄.im = B[i℄.im - Aux.im;

22 g

25 g

26 g

Figure 7: TheOTMP implementationof theFFT

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

0

5

10

15

20

25

30

35

Speed-up

Processors

64 Kb

256 Kb

1024 Kb

(a)Cra yT3E

2

3

4

5

6

7

8

1.5

2.0

2.5

3.0

3.5

4.0

NP

Speedup

64

512

1024

(b)Origin2000

Figure 8: FFT.Dierent sizes.

4.4 Parallel Sorting byRegular Sampling

Parallel Sorting by RegularSampling(PSR S) isan parallelsortingalgorithmproposedbyLiet

all[9℄.It is an example of simpleand synhronousalgorithm, and hasbeen shown eetive for

awidevarietyof MIMDarhiteture.

PSRS workswith p proess and assumes that the input list has n unsorting elements. It

arisesinfour synhronous phases:

Phase 1

Eahofthepproessorsusesthesequentialquiksortalgorithmtosortaloald n

p

eelements.

(10)

2 int w[2℄; /* weight vetor */

3 int posf[2℄,pos l[2℄; /* first and last of eah partition*/

4 int pivot, i, j;

5

6 selet(v, first, last, *pivot);

7 i=first; j=last;

8 partition(v, &i, &j, pivot);

9 /* setting eah subproblems and weight vetor*/

10 w[0℄=(j-first+1); w[1℄=(last-i+1);

11 posf[0℄=first; posf[1℄=i;

12 posl[0℄=j; posl[1℄=last;

13

14 forall(i=0; i<=1; w[i℄; (v+posf[i℄,w[i℄*sizeof(int)))

15 qs(v, pos[i℄,posl[i℄);

16 g

Figure 9: The OTMP implementationmainof QuikSort

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

0

2

4

6

8

10

12

14

16

Speed-up

No. of processors

Matrix N=46 Tasks=4 Cray T3E

Digital

SGI

SP2

Cray T3E

Cray T3D

Figure 10: QuikSort. Sizeof thevetor: 1M

data, a sortedlist asregularsample.

Phase 2

One proessorgathersand sorts every loal samples. Finally,itselets p 1 pivotsvalue

from listof regularsamples. At thispoint,eah proessors knows thenalpivotslist and

partitionsitssorteddataintopdisjointpiees, thepivotvaluesaretheseparatorsbetween

the piees.

Phase 3

Eah proessorikeeps itsith partitionandassigns thejth partitionto proessorj. Eah

proessorkeeps one partitionand reassignsp 1other partitions to other proessors.

Phase 4

Eah proessor merges its p partition into a single list. The onatenation of every

pro-essor's listsobtainsthenal sortedlistwith nelements.

OTMP algorithm to PSR S is shown in gure 11. Tosolvethe problem, it uses the global

ommuniation onstrutors, line 14 and 23. This algorithm is dierent to original propose,

everyproesses haveeverysamples and don't need extra ommuniations to know the nal

pivots.

Thequiksort inline17an bereplaed byamergeofNUMPR OCESSOR S sortedlistof

(11)

2 int j, k;

3 int *Samples, samplest; /*Samples vetor and stride in v to obtain samples*/

4 int *Pivots, pivotst; /*Pivots vetor and stride in Samples to obtain pivots*/

5 int *totask; /*First list to eah proess*/

6 int *resotask, *from size; /*List of reeived data from other proess and its size*/

7

8 int vLoal; /*Data list of loal proess*/

9

10 /* ================== Phase 1 =================*/

11 quiksort(v, size);

12 for ((k=0,j=i*NTASK); k<NTASK; (k+=samplest,j++))

13 Samples[j℄= v[k℄;

14 result(i,Samples+(i*NTASK), NTASK*sizeof(int));

15

16 /* ================== Phase 2 ================*/

17 quiksort(Samples, NTASK*NTASK);

18 for(k=0,j=1;j<=(NTASK- 1);k ++, j++)

19 Pivots[k℄=Samples[(j*N TAS K + pivotst)-1℄;

20 dividelist(v, Pivots, totask);

21

22 /* ================== Phase 3 ================*/

23 resultP(i,v,to_task,resotask,fromsize);

24

25 /* ================== Phase 4 ===============*/

26 V= mergelist(vLoal, resotask, fromsize, sizeV);

27 g

Figure 11: TheOTMP implementationof thePSRS

The gure 12 showsthe main funtion. The problem is solved byNUMPR OCESSOR S

tasks. Eah task worksin parallel over

SIZE

NUMPROCESSORS

data. V is the output vetor,it is

sorted. After the forall not is neessary anyommuniations, everyproesses havethe total

sortedvetor.

1 void main(arg, argv)f

2 int i, size, new_size;

3 int *v, *V;

4

5 size= SIZE/NUMPROCESSORS;

6 v = allo(size, sizeof(int));

7 V = allo(SIZE, sizeof(int));

8

9 initialize(v, size);

10 forall(i=0; i<=NUMPROCESSORS-1; (V[i℄,newsize[i℄))

11 PSRS(i,v,size[i℄,(V[i℄), &(ne wsize[i℄))

12 g

Figure 12: The OTMP mainfuntionofPSRS

The OTMP PSR S is an algorithm easy to understand and follow. The omputational

resultsarenotreported beause we aretakingand analyzingtheir.

5 Conlusions and Future Work

OTMPis aomputationmodeloers: simpliity,easinessofprogramming,improvement ofthe

performane (does not introdueanyoverheadforthe sequential ase), portability,and a ost

modelassoiated. Inadditionto theseharateristis,themodelextendnotonlythesequential

butthe MPIprogrammingmodel.

[image:11.595.148.492.59.320.2]
(12)

preditionof performane. The other isthat it allows exploits any nestedlevels of parallelism,

taking advantage of situations where there are several small nested loops: although eahloop

doesnotprodueenoughworktoparallelize,theirunionsuÆes. Perhapsthemostparadigmati

exampleofsuhfamilyof algorithmsarereursivedivideand onqueralgorithms.

We havea prototype for the model. Many issues an be optimized. Even inorporating

these improvements, the gains for distributed memory mahines will never be equivalent to

whatanbeobtainedusingrawMPI.Resultsobtainedprovethefeasibilityofexploitingnested

parallelismwiththe model. How ever,theombination ofevery its propertiesmakes worth the

researh and development oftools orientedto thismodel.

Aknowledgments

WewishtothanktheUniversidadNaionaldeSanLuisandtheANPCYTfromwhihwereeive

ontinuous support. Alsoto theeuropean enters: EPCC,CIEMAT, CEPBAand CESCA.

Referenes

[1℄ Aho, A. V.Hoproft J. E. and UllmanJ. D.: The Design and Analysis of Computer

Algo-rithms, Addison-Wesley,Reading,Massahusetts,(1974).

[2℄ Ayguade E., Martorell X., Labarta J, Gonzalez M. and Navarro N. Exploiting Multiple

Levelsof ParallelisminOpenMP:ACaseStudy Pro.of the1999 InternationalConferene

on Parallel Proessing,Aizu (Japan),September1999.

[3℄ BlikbergR., SrevikT..Nested parallelism: Alloationof proessors to tasksand OpenMP

implementation. Proeedings of The Seond European Workshop on OpenMP (EWOMP

2000). Edinburgh,Sotland, UK.(2000).

[4℄ Cooley,J.W.andTukey,J.W.: AnalgorithmforthemahinealulationofomplexFourier

series,Mathematis of Computation,19, 90, (1965)297{301.

[5℄ Hoare,C. A.R.: Quiksort, ComputerJournal,5(1), (1962) 10{15.

[6℄ HPF Forum: HPF Language Speiation. Version 2.0

http://danet.rie.edu/Depts/CRPC/HPFF/versions/hpf2/hpf-v20/index.html(1997)

[7℄ Keller, J. Kesler, C. Larsson, J.. Pratial PRAM programming. John Wiley & Sons in.

(2001)

[8℄ Leopold, C. Paralleland Distributed Computing: A survey of models, paradigms, and

ap-proahes.John Wiley&Sonsin.(2001)

[9℄ Li, X. Lu, P.Shaeer,J. Shillington, J. Wong, P.Shi, H.. On the Versatilityof Parallel

Sorting byRegular Sampling. Teh. Report TR 91-06. Universityof Alberta, Edmonton,

Alberta,Canada.(1992)

[10℄ MPI Forum: MPI-2: Extensions to the Message-Passing Interfae, h

ttp://www.mpi-forum.org/dos/mpi-20.ps.Z(1997).

[11℄ OpenMP Arhiteture Review Board: OpenMP Speiations: FORTRAN 2.0.

http://www.openmp.org/spes/

Figure

Figure11:

Referencias

Documento similar

In: 51st International Conference on Parallel Processing (ICPP). Aliaga et al. “A Survey on Malleability Solutions for High-Performance Distributed Computing”. “Advanced

Using the Izhikevich neuron model and a parallel model oriented to FPGA is a viable and feasible option to design a spiking neural network architecture adhered to biology being

The gain of state space decomposition techniques with respect to the classical exact solution algorithm (solution of the underlying CTMC of the original model) is in both memory

memory technology with mature ASIC logic to reduce development cost of near-memory computing architecture acceleratorsv. • Machine learning and supercomputing

- User program introduce the memory copies in the program to access host memory. - Dataflow pragma can be used with functions

Before generating the rules we use the deduction and annotation mechanisms on the initial M2M pattern specification in order to transform N-patterns into negative post-conditions of

For example, we can check whether a relation or the whole transformation is applicable in the forward direction (i.e., whether there is a source model enabling a relation), forward

The present experiments study the effects of memory contents and memory load in RSVP speeded responses in two different situations: when memory and attentional tasks must be