F. Pioli M. Printista
J.A. Gonzalez C. LeonJ.L.RodaC. Rodrguez F. Sande
UniversidadNaionaldeSanLuis. Cen troSuperiordeInformatia.
EjeritodelosAndes950,SanLuis,Argentina. UniversidaddeLaLaguna,Tenerife,Spain.
e-mail: fmpioliunsl.edu.arg
Abstrat
Sinetheearlystagesofparallelomputing,oneofthemostommonsolutionstoin
tro-dueparallelismhasbeentoextendasequentiallanguagewithsomesortofparallelversion
oftheforonstrut,ommonlydenotedasforallonstrut. Althoughsimilarsyntax,these
forall loopsdierin theirsemantis andimplementations. TheHigh PerformaneFortran
(HPF) andOpenMPversionsare, likely,among themostpopular. This paperpresentsyet
anotherforallloopextensionfortheClanguage.
Inthiswork,weintrodueaparallelomputationmodel: OneThreadMultipleProessor
Model(OTMP).Thismodelproposesanabstratmahine, aprogrammingmodelandost
model. Theprogrammingmodeldenesanotherforalllooponstrut,thetheorialmahine
aims forbothhomogeneoussharedanddistributed memoryomputers,andtheostmodel
allowsthe predition of the performane of a program. OTMP does not only integrates
and extends sequential programming, but also inludes and expands the messagepassing
programmingmodel. Themodelallowsandexploits anynestedlevelsofparallelism,taking
advantageofsituationswhere thereareseveralsmallnestedloops.
Keywords: Computation Model,AbstratMahine,ProgrammingModel,CostModel.
1 Introdution
Whenompared parallelomputingwith thatof sequential omputing,the urrentsituation is
dierent. No single model of parallel omputation has yetome to dominate developments in
parallelomputingin the waythat vonNeumannmodel has dominatedsequential omputing.
The aims of parallel omputing model are: to desribe lasses of arhiteturesin simple and
realisti terms and, to propose the design methodology of parallel algorithm. It provides an
abstratview of boththetehnologiesand appliations. An abstrat modeldenes suh asthe
algorithmsaredesignedand analyzed intheabstrat modeland odedinprogrammingmodel.
The design of parallel programsrequires fany solutions that are not present in sequential
programming. Thus,adesignerofparallelappliationsisonernedwiththeproblemofensuring
theorretbehaviorof all theproesses thattheprogram omprises.
Generally, a programmingmodel desribesa programmer's view of the parallel systems. It
has a dierent meaning in other areas of omputer siene, but in partiular, it often stands
for omplete semantis of a language. A good parallel programming model has to satisfy the
next properties: Programmability,Reetivity, Cost model, Arhiteture independene and
exibility[8℄.
In this paper, wepresent OTMP omputation model,its abstrat mahine and, its
eneswith theirurrent versions.
Thenextsetionpresentsthetheorialmahine,theprogrammingmodel: syntaxandsemanti,
and the assoiated omplexity model. Setion three deals load balaning issues and mapping
andshedulingpoliiesused. Finally,thesetionfourshows manyexamplesandomputational
results.
2 OTMP Model
One of the most ommon solutions to introdueparallelism has been to extend a sequential
language with a short set of parallel onstrutors. Suh onstrutor are a versionof the for
onstrut, ommonly denoted as forall onstrut. This model proposes another forall loop
extension for the C language. These extensions are aimed both for homogeneous distributed
memoryand sharedmemoryarhitetures.
Thismodelhasseveral dierenes withother model[6 ℄[11 ℄. Themost important are:
Theparallelprogrammingmodelintroduedhasassoiatedaomplexitymodelthatallows
the analysisand preditionof theperformane.
Itallowsandexploitsanynestedlevelsofparallelism,takingadvantageofsituationswhere
there areseveral smallnestedloops: althougheahloopdoesnotprodueenoughworkto
parallelize,theirunionsuÆes. Thereursivedivideand onqueralgorithmsarethemost
paradigmati example.
Itdoesnotonlyintegratesandextendsthesequentialprogrammingmodelbutalsoinludes
and expands themessage passingprogrammingmodel.
2.1 Abstrat Mahine
OTMP theorial omputeris a mahine omposed ofa numberofinnite proessors,eah one
with its own private memory and a networkinterfaeonneting them. The proessors are
organized ingroups. At any time, the memorystate of all theproessors inthe same group is
idential. An OTMP omputation assumes that they also havethe same input data and the
same program in memory. The initial group is omposed of all the proessors in the mahine.
Eah proessor is a RAM mahine [1℄, the only dierene among them is an internalregister,
theNAME oftheproessor. Thisregisterisnotavailabletotheprogrammer.Aordingtothis
denition,every real parallelmahineis aOTMP mahine.
2.2 Syntax and Semanti
Theprogramming model being introdued, extends the lassisequential imperative paradigm
witha newonstrut: parallelloops. The modelaims forbothdistributedand sharedmemory
arhitetures.The implementationon thelastan be onsideredthe\easypart" ofthe task.
The programmer withtheparallelloops
forall(i=rst;i<=last;(r[i℄,s[i℄))
ompound_statement_i
states that the dierent iterations i of the loop an be performed independently in parallel.
(r[i℄,s[i℄)isthememoryareaoftheresultsofthei-thiteration,wherer[i℄pointstorstpositions
ands[i℄ is thesizeinbytes.
Whenreahingtheformerforall loop,eahproessordeidesintermsofits NAME the
anyproessorNAME rulingthedivisionproess are:
Numberofiterationstodo: M=l ast first+1 (1)
Iterationtodo: i=first+NAME%M (2)
for(j=1;j<M;j++)
neighbor[j℄=+(NAME+j)%M (3)
whereisgivenby:=M(NAME=M)
Newvalueofregister: NAME=NAME=M (4)
These equations deide the mapping of proessors to tasks, and howthe exhangeof results
willtake plae. After theforallis nished,the proessorsreovertheirformervalueof NAME.
Eah time a forall loop is exeuted, the memory of the group, up to that point ontains
exatly the same values. At suhpoint the memory is divided in twoparts: the one that is
going to be modied and theone that isnot hangedinside the loop. Variables inthe lastset
area vailableinside theloop forreading.The othersarepartitioned among thenew groups.
Thelastparameterintheforall hasthepurposeto informthenew\ownership"of thepart
of the memory that is going to be modied. It announes that the group performingthread i
\owns"(and presumablyisgoing to modify)the memoryareas delimitedby (r[i℄;s[i℄). r[i℄+j
isa pointerto thememoryarea ontainingthe j-th resultof thei-th thread.
Toguarantee thatafter returningto the previousgroup,the proessors inthe father group
haveaonsistentviewofthememoryandthattheywillbehaveasthesamethread,itisneessary
to providethe exhangeamong neighbors of the variablesthat w eremodied inside the forall
loop. Let us denote the exeution of the body of the i-th thread (ompound statementi) by
T
i
. The semantiimposes two restritions:
1. Given two dierentindependentthreadsT
i and T
k
and two dierent result itemsr[i℄ and
r[k℄,it holds:
[r[i℄;s[i℄℄ T
[r[k℄;s[k℄℄=;8i;k
2. Foranythread T
i
and any result j, all the memoryspae dened by[r[i℄+j;s[i℄℄ hasto
be alloated previously to the exeution of the thread body. This makes impossiblethe
use ofnon-ontiguous dynamimemorystrutures.
The programmer has to be speiallyonsious of the rst restrition: it is mandatory that
theaddress of an ymemory ellwritten during theexeution of T
i
has to belong to one of the
intervals in the list of results for the thread. As an example, onsider the ode in gure 1.
1 forall(i=1; i<=3; (ri[i℄, si[i℄))
2 f ...
3 forall(j=0; j<=i; (rj[j℄, sj[j℄))f
4 int a, b;
5 ...
6 if (i % 2 == 1)
7 if (j % 2 == 0)
8 send(j+1, a, sizeof(int));
9 else reeive(j-1, b, sizeof(int));
10 ...
11 g
12 ...
13 g
Figure 1: Two nestedforalls
theirloalmemory. Applyingequations1,2,3and4,theparallelloopinline1ofgure1divides
thegroup inthree. After theexeution of the loop, and to keep theoherene of the memory,
eahproessorexhangeswithitstwoneighborstheorrespondingresults. Thus,proessor0in
thegroup exeutes iteration i=1 and sends its memory area, (ri[1℄,si[1℄) to proessors 1 and
2. Furthermore, itreeivesfromproessor1 (ri[2℄,si[2℄) andfromproessor2(ri[3℄, si[3℄). The
sameexhangeisrepeatedamongtheother orrespondingtriplets( (3;4;5);(6;7;8);:::) . Every
new nested forall reates/strutures the urrent subgroup aording as a M-ary hyperube,
whereM isthenumberofiterationsintheparallelloopand theneighborhoodrelationis given
byformula3. Thus,thisrstforallprodues\thefae"ofaternaryhyperubidimension,where
everyornerhastwoneighbors. Theseond nestedforall atline3requestsfordierent number
Figure 2: The mappingassoiatedwiththetwo nestedforallingure 1
of threads inthe dierent groups. Forexamples, to thelast group (i=3) exeutes a forallof
size4,and onsequentlythegroup ispartitioned in4 subgroups. In this4-arydimension, eah
proessor is onneted with 3 neighbors in the other subgroups. Therefore, at the end of the
nestedompoundstatement, proessor 17 willsend its data, rj[1℄, to proessors 14, 20 and 23
andwillreeive from them theirdatarj[0℄, rj[2℄ and rj[3℄.
2.3 Others Clauses
TheOTMP modeladmitsseveral oherentextensions,inthissetionwepresent tw o:redution
lausesand globalommuniation.
The redution lauses havesyntax and semanti similar to the aboveforall, its simplied
syntax is
forall R(i=rst;i<=last;(r[i℄,s[i℄);fr[i℄;(rr[i℄,sr[i℄))
f
ompound_statement_i;
fr;
g
The forallR works of the same waythat forall, divides the proessors, exeutes in parallel
ompound statement i overowns data and exhangesthe results, r. When all is done, eah
proessorreduestheresultsthroughf r. The resultof f r isrr and ithassizesr. Thef r an
be any funtion,butit ismandatorythat ithasto be ommutativeand assoiative.
OtherOTMPlauses aretheglobalommuniationlauses: r esult and r esultP.Their
ery proessors in theother groups, and r esultP ommuniates data to every proessors inthe
other groups, but the data are dierent and depend of the reeptor proessor's group. The
orresponding syntaxis
result(i,data[i℄,s[i℄)
result P(i,data to[i℄,sto[i℄,data from[i℄, s from[i℄)
whereiidenties thetask,T
i
. In therst lauses,data[i℄ pointsto thedataof T
i
ands[i℄ is its
size. In the seond lauses, datato[i℄ is a pointer to the memory area ontaining the data to
sendto T
i
,ands to[i℄ontainsitssize. datafrom[i℄ands from[i℄havethesamefuntion,but
to reeived data of T
i
. These lauses are similarto funtions gather and satter in standards
librarysuhasMPI[10 ℄,butoneofdierenesistheneighborhoodrelation,ittakesplaeevery
thattheexhangingofdata isneessary.
2.4 Cost Model
In this setion, w eanalyze the ost of the forall loop. A similar analysis is required to other
lauses.
TheomplexityT(P)ofanyOTMP programP anbeomputedinwhatrefertosequential
ordinaryonstruts(while, for, if,... ) asintheRAMmahine[1 ℄. Theost oftheforallloop
isgiven by thereursive formula:
T(forall)=A
0 +A 1 M+ e 2 max i=e 1 ( T(T
i ))+ e2 X i=e 1 (gN
i
)+L (5)
whereT
i
istheodeofompound statement i,M isthesizeoftheloopandN
i
isthesizeofthe
messagetransferred initerationi,
M =(e
2 e
1
+1) N
i = P m k=1 s ki A 0
istheonstant timeinvestedomputingformulas1,2and4. A
1
isthetimespentndingthe
neighbors(formula3). Constantgistheinverseof thebandwidth,andListhestartuplateny.
Assuming the rather ommon ase in whih the volumeof ommuniationof eahiteration is
roughlythesame
N N
i
N
j
8i6=j =e
1 ;e
2
Fromformula 5 and from the fat that in urrent mahines the shedulingand mapping time
A
0 +A
1
M isdominatedbytheommuniationtime P
e
2
i=e1 (gN
i
+L),itfollows theresult
thatestablisheswhen an independentfor loop isworthto onvert inaforall:
T(forall)T(for),M (gN)+L+ e2 max i=e1 T(T i ) e2 X i=e1 T(T i ) (6)
AremarkablefatisthattheOTMP mahinenotonlygeneralizesthesequentialRAMbutalso
theMessage PassingProgramming Model. Lines 6-8 in gure 1 illustrate the idea. Eah new
forall \reates"aommuniator(intheMPIsense). Theexeutionoflines7and8inthefourth
group(i=3)impliesthat thread j=0sendsato thread j=1. Thisoperation arriesthat, at the
same timethat proessor2 sends its repliaof ato proessor 5,proessor14 sends its opyto
proessor 17 and so on. To summarize: anysendor reeive is exeuted bythe innite ouples
involved.Still,thetwoaforementionedonstraintshavetobetrue.Anyvariablemodiedinside
Unfortunately, an innite mahine, suhas that desribed in the previous setion, is only an
idealization. Real mahines ha vea restrited number of proessors. Eah proessor has an
internal register NUMPROCESSORS where stores the number of availableproessors in its
urrent group. Three dierent situations have to be onsidered in the exeution of a forall
onstrutwith M =e
2 e
1
+1threads:
NUMPROCESSORSisequalto1;
M islargeror equalthanNUMPROCESSORS;
NUMPROCESSORSislargerthanM;
The rst ase is trivial. Thereis only one proessor that exeutes all the threads sequentially,
andthere is notopportunityto exploitanyintrinsiparallelism.
The seond ase has been extensively studied as at parallelism. The main problem that
arises is the load balaning problem. Todeal with it, several sheduling poliies havebeen
proposed[11 ℄. Manyassignmentpoliiesarealsopossible: blok,yli-blok,guided ordynami.
Thethirdasewasstudiedintheprevioussetion. Butthefatthatthenumberofavailable
proessorsislargerthanthenumberofthreadsintroduesseveraladditionalproblems: therst
isload balaning. The seondis that, notanymore,the groupsare dividedinsubgroupsof the
samesize.
If a measure of the workw
i
per thread T
i
is available,the proessors distribution poliy
viewed in the previous setion an be modied to guarantee an optimal mapping [3 ℄. The
syntax of theforall isrevisitedto inlude thisfeature:
forall(i=rst; i<=last; w[i℄; (r[i℄, s[i℄))
ompound_statement_i
If there are not weights speiation, the same w orkload is assumed for every task. The
semantiis similarto thatproposedin[2 ℄. Therefore,themapping isomputed aording to a
poliysimilartothatskethedin[3 ℄.Thegure3showstheappliedmappingalgorithm. There
is, however, the additional problem of establishing the neighborhood relation. This time the
simpleone to (M 1) hyperubirelationoftheformersetion doesnothold. Instead,the
hyperubi shapeis distorted to a polytope holding the property that eah proessor in every
grouphasone and onlyone inomingneighborinanyof theother groups.
1 for(i = 0; i< M; i++)
2 p_i = 1;
3 for( j = M+1 ; j< NUMPROCESSORS; j++)
4 f
5 Get p_k suh as wk
pk
maxi= 2;:::; M wi
pi ;
6 p_k= p_k + 1;
7 g
Figure3: Optimal Mapping Algorithm
Thisalgorithm satiseswith thepostulate of optimalmapping.
4 Examples
Many examples havebeen hosento illustrate the use of the OTMP model: Matrix
Multipli-ation,Fast FourierTransform, QuikSortand ParallelSort byRegular Sampling. The results
arepresented forseveral mahinesand we usetheurrentsoftware systemthat onsistsof a C
Theproblemtosolveistoomputetasksmatrixmultipliations(C i =A i B i
i=0;:::tasks
1). Matrix A i
and B i
haverespetivelydimensionsmq
i and q
i
m. Therefore, the produt
A i
B i
takesanumberofoperationsw[i℄ proportionaltom 2
q
i
. Figure4showsthealgorithm.
VariablesA,B and C arearrays of pointers to thematries. The loop inline 1deals withthe
dierent matries, the loops in lines 5 and 7 traversethe rows and olumns and nally, the
innermost loop in line 8 produes the dot produt of the urrent rowand olumn. Although
allthe for loopsareandidates to beonvertedto forall loops, we willfouson twoases: the
parallelizationofonlytheloopinline5(labeledFLATintheresults,gure5)andtheoneshown
ingure4 whereadditionally,theloopat line1is also onverted to aforall (labelNESTEDin
gure5). Thisexample illustratesone of the ommon situationwhere you an take advantage
1 forall(i = 0; i < tasks; w[i℄; (C[i℄, m * m))
2 f
3 q = ...;
4 Ci = C+i; Ai = A+i; Bi = B+i;
5 forall(h = 0; h < m; (Ci[h℄, m))
6 f
7 for(j = 0; j < m; j++)
8 for(r = &Ci[h℄[j℄, *r=0.0, k=0; k<q; k++)
9 *r += Ai[h℄[k℄ * Bi[k℄[j℄
10 g
11 g
Figure 4: Exploiting2levels of parallelism
of nested parallelism: when neither the inner loop (lines 5-10) nor the external loop (line 1)
haveenough w orkto havea satisfatory speedup, but the ombination of both does. We will
denote bySP
R
(A) the speedup of an algorithm A with R proessors and byT
P
(A) the time
spentexeutingalgorithmAonP proessors. Letusalsosimplifytotheasewhenalltheinner
tasks takethe same time, i.e. q
i
=m. Under thisassumptions, the previousstatement an be
rewrittenmore preisely:
Lemma1
Letbe
tasks<P, SP
P
(inner)<SP
P=tasks
(inner) and tasks(gm 2
)+LT
P=tasks
(inner)
then
SP
P
(FLAT)<SP
P
(NESTED)
Thegure5showstheexpressedbeforeforthisexample. Bothmatrixhave dimensions4545.
Thenumberof taskis 8,and to when thenumberof proessors is 2 and 4,there are proessor
virtualization. In allase and to dierentarhitetures, thenestedparallelismworkswell.
Moreover, when the size of the problems to solveis large, the speedup grows to be
quasi-lineal. We hek that the speedup reahed is linear asit showsgure 6. In thenext examples
wewillonentrateinthemore interestingasewherethenestedloopsaresmalland,aording
to theomplexityanalysis,there arefew opportunitiesforparallelization.
4.2 Fast FourierTransform
The Disrete Fouriertransform is widely used insolving problems in siene and engineering.
Itis denedas
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0
2
4
6
8
10
12
14
16
Time (secs.)
No. of processors
Matrix N=46 Tasks=4 Cray T3E
FLAT
NESTED
(a)Cra yT3E
2
3
4
5
6
7
8
0.4
0.5
0.6
0.7
0.8
0.9
1.0
NP
Time
Flat
Nested
(b)Origin2000
Figure5: Nested versusFlatparallelism
0
2
4
6
8
10
12
14
16
0
2
4
6
8
10
12
14
16
Speedup
No. of processors
Matrix N=500 Tasks=4 Cray T3E
Large Size
Figure6: Speedupreahed forlargesizeproblems
TheOTMPodeingure7implementstheFastFourierTransform(FFT)algorithmdeveloped
byTukeyandCooley[4℄. ParameterW
k =e
2ik=N
ontainsthepowersoftheN-throot ofthe
unity.
Thegure8 showstheomputationalresultsforsuhimplementationonaSGIOrigin2000
andCRAY T3E.Thespeedupurve behavesaordingtowhataomplexityanalysisfollowing
formula 5 predits. Moreover, when thesizeof problem issmall, theommuniations inuene
ontheperformane.
4.3 QuikSort
The well-known quiksort algorithm [5 ℄ is a divide-and-onquer sorting method. As suh,it
is amenable to a nested parallel implementation. This example is speially interesting,sine
the size of the generated sub-vetors are used as weights for the forall, see gure 9, line 14.
Rememberthat,dependingonthegoodnessofthepivothosen,thenewsubproblemsmayhave
ratherdierentweights.
Figure 10presentsthespeed-upson adigitalAlpha Server, anOrigin2000, an IBM SP2,a
unsigned N,unsigned stride,Complex *D)f
2 Complex *B, *C, Aux, *pW;
3 unsigned i, n;
4
5 if(N == 1) f
6 A[0℄.re = a[0℄.re;
7 A[0℄.im = a[0℄.im;
8 g
9 else f
10 n = (N >> 1);
11 forall(i = 0; i <= 1;(D+i*n , n))
12 f
13 llFFT(D+i*n,a+i*stride,W ,n, stri de< <1,A +i* n);
14 g
15 B = D; /* Combination phase */
16 C = D + n;
17 for(i = 0,pW = W; i<n; i++,pW += stride) f
18 Aux.re = pW->re*C[i℄.re - pW->im*C[i℄.im;
19 Aux.im = pW->re*C[i℄.im + pW->im*C[i℄.re;
20 A[i℄.re = B[i℄.re + Aux.re;
21 A[i℄.im = B[i℄.im + Aux.im;
22 A[i+n℄.re = B[i℄.re - Aux.re;
23 A[i+n℄.im = B[i℄.im - Aux.im;
22 g
25 g
26 g
Figure 7: TheOTMP implementationof theFFT
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
0
5
10
15
20
25
30
35
Speed-up
Processors
64 Kb
256 Kb
1024 Kb
(a)Cra yT3E
2
3
4
5
6
7
8
1.5
2.0
2.5
3.0
3.5
4.0
NP
Speedup
64
512
1024
(b)Origin2000
Figure 8: FFT.Dierent sizes.
4.4 Parallel Sorting byRegular Sampling
Parallel Sorting by RegularSampling(PSR S) isan parallelsortingalgorithmproposedbyLiet
all[9℄.It is an example of simpleand synhronousalgorithm, and hasbeen shown eetive for
awidevarietyof MIMDarhiteture.
PSRS workswith p proess and assumes that the input list has n unsorting elements. It
arisesinfour synhronous phases:
Phase 1
Eahofthepproessorsusesthesequentialquiksortalgorithmtosortaloald n
p
eelements.
2 int w[2℄; /* weight vetor */
3 int posf[2℄,pos l[2℄; /* first and last of eah partition*/
4 int pivot, i, j;
5
6 selet(v, first, last, *pivot);
7 i=first; j=last;
8 partition(v, &i, &j, pivot);
9 /* setting eah subproblems and weight vetor*/
10 w[0℄=(j-first+1); w[1℄=(last-i+1);
11 posf[0℄=first; posf[1℄=i;
12 posl[0℄=j; posl[1℄=last;
13
14 forall(i=0; i<=1; w[i℄; (v+posf[i℄,w[i℄*sizeof(int)))
15 qs(v, pos[i℄,posl[i℄);
16 g
Figure 9: The OTMP implementationmainof QuikSort
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
0
2
4
6
8
10
12
14
16
Speed-up
No. of processors
Matrix N=46 Tasks=4 Cray T3E
Digital
SGI
SP2
Cray T3E
Cray T3D
Figure 10: QuikSort. Sizeof thevetor: 1M
data, a sortedlist asregularsample.
Phase 2
One proessorgathersand sorts every loal samples. Finally,itselets p 1 pivotsvalue
from listof regularsamples. At thispoint,eah proessors knows thenalpivotslist and
partitionsitssorteddataintopdisjointpiees, thepivotvaluesaretheseparatorsbetween
the piees.
Phase 3
Eah proessorikeeps itsith partitionandassigns thejth partitionto proessorj. Eah
proessorkeeps one partitionand reassignsp 1other partitions to other proessors.
Phase 4
Eah proessor merges its p partition into a single list. The onatenation of every
pro-essor's listsobtainsthenal sortedlistwith nelements.
OTMP algorithm to PSR S is shown in gure 11. Tosolvethe problem, it uses the global
ommuniation onstrutors, line 14 and 23. This algorithm is dierent to original propose,
everyproesses haveeverysamples and don't need extra ommuniations to know the nal
pivots.
Thequiksort inline17an bereplaed byamergeofNUMPR OCESSOR S sortedlistof
2 int j, k;
3 int *Samples, samplest; /*Samples vetor and stride in v to obtain samples*/
4 int *Pivots, pivotst; /*Pivots vetor and stride in Samples to obtain pivots*/
5 int *totask; /*First list to eah proess*/
6 int *resotask, *from size; /*List of reeived data from other proess and its size*/
7
8 int vLoal; /*Data list of loal proess*/
9
10 /* ================== Phase 1 =================*/
11 quiksort(v, size);
12 for ((k=0,j=i*NTASK); k<NTASK; (k+=samplest,j++))
13 Samples[j℄= v[k℄;
14 result(i,Samples+(i*NTASK), NTASK*sizeof(int));
15
16 /* ================== Phase 2 ================*/
17 quiksort(Samples, NTASK*NTASK);
18 for(k=0,j=1;j<=(NTASK- 1);k ++, j++)
19 Pivots[k℄=Samples[(j*N TAS K + pivotst)-1℄;
20 dividelist(v, Pivots, totask);
21
22 /* ================== Phase 3 ================*/
23 resultP(i,v,to_task,resotask,fromsize);
24
25 /* ================== Phase 4 ===============*/
26 V= mergelist(vLoal, resotask, fromsize, sizeV);
27 g
Figure 11: TheOTMP implementationof thePSRS
The gure 12 showsthe main funtion. The problem is solved byNUMPR OCESSOR S
tasks. Eah task worksin parallel over
SIZE
NUMPROCESSORS
data. V is the output vetor,it is
sorted. After the forall not is neessary anyommuniations, everyproesses havethe total
sortedvetor.
1 void main(arg, argv)f
2 int i, size, new_size;
3 int *v, *V;
4
5 size= SIZE/NUMPROCESSORS;
6 v = allo(size, sizeof(int));
7 V = allo(SIZE, sizeof(int));
8
9 initialize(v, size);
10 forall(i=0; i<=NUMPROCESSORS-1; (V[i℄,newsize[i℄))
11 PSRS(i,v,size[i℄,(V[i℄), &(ne wsize[i℄))
12 g
Figure 12: The OTMP mainfuntionofPSRS
The OTMP PSR S is an algorithm easy to understand and follow. The omputational
resultsarenotreported beause we aretakingand analyzingtheir.
5 Conlusions and Future Work
OTMPis aomputationmodeloers: simpliity,easinessofprogramming,improvement ofthe
performane (does not introdueanyoverheadforthe sequential ase), portability,and a ost
modelassoiated. Inadditionto theseharateristis,themodelextendnotonlythesequential
butthe MPIprogrammingmodel.
[image:11.595.148.492.59.320.2]preditionof performane. The other isthat it allows exploits any nestedlevels of parallelism,
taking advantage of situations where there are several small nested loops: although eahloop
doesnotprodueenoughworktoparallelize,theirunionsuÆes. Perhapsthemostparadigmati
exampleofsuhfamilyof algorithmsarereursivedivideand onqueralgorithms.
We havea prototype for the model. Many issues an be optimized. Even inorporating
these improvements, the gains for distributed memory mahines will never be equivalent to
whatanbeobtainedusingrawMPI.Resultsobtainedprovethefeasibilityofexploitingnested
parallelismwiththe model. How ever,theombination ofevery its propertiesmakes worth the
researh and development oftools orientedto thismodel.
Aknowledgments
WewishtothanktheUniversidadNaionaldeSanLuisandtheANPCYTfromwhihwereeive
ontinuous support. Alsoto theeuropean enters: EPCC,CIEMAT, CEPBAand CESCA.
Referenes
[1℄ Aho, A. V.Hoproft J. E. and UllmanJ. D.: The Design and Analysis of Computer
Algo-rithms, Addison-Wesley,Reading,Massahusetts,(1974).
[2℄ Ayguade E., Martorell X., Labarta J, Gonzalez M. and Navarro N. Exploiting Multiple
Levelsof ParallelisminOpenMP:ACaseStudy Pro.of the1999 InternationalConferene
on Parallel Proessing,Aizu (Japan),September1999.
[3℄ BlikbergR., SrevikT..Nested parallelism: Alloationof proessors to tasksand OpenMP
implementation. Proeedings of The Seond European Workshop on OpenMP (EWOMP
2000). Edinburgh,Sotland, UK.(2000).
[4℄ Cooley,J.W.andTukey,J.W.: AnalgorithmforthemahinealulationofomplexFourier
series,Mathematis of Computation,19, 90, (1965)297{301.
[5℄ Hoare,C. A.R.: Quiksort, ComputerJournal,5(1), (1962) 10{15.
[6℄ HPF Forum: HPF Language Speiation. Version 2.0
http://danet.rie.edu/Depts/CRPC/HPFF/versions/hpf2/hpf-v20/index.html(1997)
[7℄ Keller, J. Kesler, C. Larsson, J.. Pratial PRAM programming. John Wiley & Sons in.
(2001)
[8℄ Leopold, C. Paralleland Distributed Computing: A survey of models, paradigms, and
ap-proahes.John Wiley&Sonsin.(2001)
[9℄ Li, X. Lu, P.Shaeer,J. Shillington, J. Wong, P.Shi, H.. On the Versatilityof Parallel
Sorting byRegular Sampling. Teh. Report TR 91-06. Universityof Alberta, Edmonton,
Alberta,Canada.(1992)
[10℄ MPI Forum: MPI-2: Extensions to the Message-Passing Interfae, h
ttp://www.mpi-forum.org/dos/mpi-20.ps.Z(1997).
[11℄ OpenMP Arhiteture Review Board: OpenMP Speiations: FORTRAN 2.0.
http://www.openmp.org/spes/