III. MATERIALES Y MÉTODOS
3.2 MÉTODOS
3.2.2 VARIABILIDAD DIARIA DE LA PRECIPITACIÓN Y SU RELACIÓN
1/0 and Networking
Software in Sequoia 2000
The Sequoia 2000 project req ui res a high-speed network and
1/0
software for the support of global change research. In addition, Seq uoia distributed appl ications req u i re the efficient movement of very large objects, from tens to hundreds of megabytes i n size. The network architecture incorporates new designs and i mplementations of operating system1/0
soft ware. New methods provide significant per formance i mprovements for transfers among devices and processes and between the two. These techniques reduce or eliminate costly mem ory accesses, avoid unnecessary processing, and bypass system overheads to im prove through put and reduce latency.\ 'ol. 7 ;-..:" ·' 1 99 S
I
Joseph Pasquale
Eric
W.Anderson
Kevin Fall
Jonathan
S. KayI n the Seq uoia 2 000 project, 11 e �1ddressed rile [Jroh
lem of designing a d istri buted computn SI'Stem th�H can efticien tlv rerriel'l:, store , <llld rranste1· the 1·en· l :1rge d,\Ll objects conr,1 i ned i n earth science :�ppl ica rions. B1· l'en· large , ,,.e mean thra objccrs in ncess
of tens or e1 e n h u n d reds of mega lwres
( M B ) .
Earthscience research h:�s massive computational requ ire menrs, in large p:trt d u e to the large dao objects often t()lllld in i rs :tpplicarions. Thne '1re man1· es:1mplcs: an
ad1 a need 1·en· high - resol u tion radiometer ( AVHRR )
image c u be req u i res 300 MB, �111 ad1·aJKed l'isiblc and
i n frared i maging spectrometer ( AVI R ! S ) 1111�1ge
req u i res
1 40
MB, �md the common Lmd satellite( LA):' DSAT )
image requi res2 7�
,\•I B Am· throughputbottleneck i n a d istri b uted computer SI'Stem becomes
greatll' m'1gni tied 11 hen de:1 l i n g with such l arge
objects . I n ,1ddition , Seq uoia
2000
11·as :111 e:q)eri memi n d istri buted colla hor:Hion; rhus, col la bor:nion rools such as \'tticoco n k rcncing \\'ere :1lso i m port:mr appli cations to support.
O u r efforts i n rhe p mject t<xused on opcr�Hing 51'S rem 1/0 :1nd the netll'ork. We designed rhe Scquoi :1
2000
11·ide are:� nctll' ork (WA N ) rest bed , :tnd 11·ee \ p l ored nell' designs in opnari ng svstem 1/0 and
nen1 ork sofn,·at-c . The contri butions of this paper arc t\\'ot(ll d : ( I ) i t sutTCI'S the m;l i n results of this 1\'0rk :llH.i puts them in pcrspecti1-c lw relating them ro the general
tbra
rr�1nsti.:r proble m , and( 2 )
it prcscnrs :1 l lC\1. design t(>r conrai ncr shipping. ( fo r a com pleted iscussion of container shippi ng, see Rekrence
l . )
Si nce conrainn s h i pping is a nell' design, this p:�per de1·orcs more sp:�ce to ir in re l :trion to the other sur ' c1·ed '' ork ( 11·hose der�1ikd descr i ptions m:l\' be found
i n Rdcret1ces 2 ro
9 ) .
I n addi rion ro this 11 ork, '' e cond ucred orhn ncrll'ork studies �1s parr of the Sequoi�1
2 000
l)rojccr. These i n c l ude 1-csearch on p rotocols topro1·ide f)Crtormance guaranrecs .md m u ltic1sting.10 ,-
To support a higJ J -pcrt(mn:lncc d istributed comput ing etwi mnmenr in ll'bicb :lpplicuions em c ftCetii'Cl\' manipu l�ue large d:H:J objects, 1\'C 11-cre concerned 11·ith achin in g. high throughput during the rranster of these objccrs. The processes or d e�· ices representing the d;1L1 sou rces :md sinks m:ty a l l reside o n the s'1mc work St<Hiou ( si ngle node case ) , or thcv rna1· be distri buted o1·cr IJ\<1111' ,,·orkst.ltions connected b1· the nen1·ork
( m ultiple node case ). l n either c:�se , we w�1nn.:d app l i C1tions, b e the�· c1rt
h
science distributed co m p u tations or collabor�nion tool s i111·oh- i n g m u l tipoint
video, ro 111�1kc full use of the raw bandwidth prm·ided
bv the un derlving commu nicarion svsrem .
I n the m u l tiple node case , the ra11 band11·idrh is
h·om 45 ro
1 00
meg�1birs per second ( M b/s), bec:ll!sethe Sequ oi�1
2000
n e mork used T3 l i n ks t(>r longd istJnce communication :md a ti ber d istri buted data i nrc rbce ( F DD! ) tc>r loc1l area communicnion. ln rhe
si ngle node case , rhc l'JII' band11·idth is approxi mJt
e
h·
1 00
megabnes �1c r second, since rhe workstation ofchoice w:�s one of the D ECstation
5000
series or theAlp h�1-powcred D EC
3000
series, both of which usethe TL! RBOchannel �1s the SI'Stcm hus.
Our 11·ork focused onh· on soft11·;m:: im pro1·e mct1ts,
in particular how to achieve maximum svsrem sotT\\':tre
per te>rnuncc gi1-cn the hardware we selected. I n bet,
we found that the throughput bottle necks in the
Seq uoi�1 distri buted computing cm·i ronme n t 11·ere
i ndeed in the 11·orkst:1rion 's operating svstcm softii'Jre,
and nor in the u ndnlving com m u nication svstem
hard11·�1re ( e .g., nctll'ork l i nks or the svstem bus)·. This
problem is not lim ited to the Scqu (
;
iJ em ironment:gi1·cn modern high -speed workstations (
100+
mi l lionsof i nstructions per second [m ips
1 )
<111d t:1sr networks( 100+
Mb/s ) , pcrt<)rmJnce botrlenecks arc oftencaused [)I' sotT\\'are, cspccialh· opu:1ring svste m soft
ware . Svstem softw:1 re rbroughput has not kept up
with the throughputs of 1/0 devices, especiJIIV net
work alhpters, ll' hich have i mpro1·cd trcmenL
i
ouslvin recenr \'l'Jrs. These
rcchnolos"·
i m pro1·cmcnts arebeing d ri1-cn by J ne11· generation of appl ications, such
as intcracti1·e multimedia invoking digital 1·ideo and
high-resolution gr:1phics, that h;wc high IjO through
p u t req u i rements. Supporting these applications and
controlling th ese dc1·ices
han:
tJ:-:cd operating s1·stemtec hnol ogv, L . much o�· ll'hich II'Js duri tw times U
when intensive 1 /0 ll'as nor an issue.
I n the next section of this pJpcr, ll'e describe the
Seq uoia
2000
nct\\'ork, 11 hich scrH:d JS an operimental test bed �or our 11·ork. Fol lowing that, 11·e :1 nah·ze
the d�lLl tLmsfcr probl e m , ll'hich SCJ�ves as the con
;ex
rtor the three subsequent sections . There liT describe
our sol u tions to the d.lLl transfer problem. hn:1llv, 11·e present our concl usions.
The Sequoia
2000
Network Test BedThe Sequoi�1
2 000
nenmrk is �1 pri1·�1te WA:\ tlLlt we designed to span ti1·c Gl ll1�1Uscs at rhe U ni1·ersi tv of Cal i t(>rni<l : Bcrkclcv, Davis, Los f\ nge les, SJn Di�
go,and S�1nta BarbJrJ . The topologv is sholl'n in hgure J .
The b�1ekbone l i n k speeds a rc 45 Mb/s ( T 3 ) 11·ith the exception of the Berkde1·- D:11·is l i n k, ll' hich is 1 . 5 1Yi b/s (Tl ) . At
eac h
campus,one
or more FDDIFigure 1
S..:quoi�1 2000 Research N<.:tll'mk
local area nctll'orks ( LANs ) that opcr�ltc :J.t
l 00
M b/s�m:: used t(>r local d istri bu tion . At some campuses, the con figuration is a hierarchical set of ri ngs. For
example, Jt UC San Diego, one f' D D I ring cmTred
rhc campus and joined th ree sep�lrJtc ri ngs: one at
rhc Computer Systems Lab ( ou r l aborJtorv ) in the
Department of Computer Scie nce ;md En
g
ineering,one at the Scripps Institution of Occanograph1·, :md
one at the San Diego Supercomputer Center.
We used high-pe rtormance genera l -purpose co m
pu ters a s routers, origin�1lly DECsr:1rion
5000
seriesand later DEC
3000
series (Alpha-powered) II'Orkstations. Using 11·orkstations as routers running the
LJ LTR! X or the DEC: OSf/ l ( no11 Digital u \: J :\ )
operating system prm·idcd u s with :1 moditiablc soft
II'Jrc platt(mn tor ex peri mentation . The T3 (a nd T l )
intertace bo�1rds 11·ere spcei:1llv built
lw
Da1·id Boggs JtDigita l . We used
otl� rhe-shelf
Digit;ll products t(>r.F D D I bo:�rds, both models DEF'lA, 11· hich supports
both send and receive direct memory �Kccss ( r:li'vi A ) ,
:111d D E fZA, ll'hich su pports onlv rccci1·c D lvi A .
The Data Transfer Problem
Since a d�1r;1 sou rce or si n k mav he either a process or
de1
·
ice, and the operating SI'Stem gcncral h- pcrt())'lns the function of' transtCrring data bct11·cen processes and devices, u ndcrst:1nding the hotrl enceks in these operating sysn: m data p:1ths is kcv ro improving perfonn;1nec. These d atJ paths gcncralh- i moh-c trJ �·ersi ng n umerous la1·crs ofoper:ni ng s1·stetn software .l n the case of netll'ork rr:1nskrs, the data paths �11-c
To u n dnst;uld rhc pcd(nm;mcc problem 11·c \\'CIT
tr
�·in
gro
sokc , consider a common cliem-se!Ycr i nrcracrion in which <1 client has ITq u L·srcd Lh t;l ti-om ;1 scn·cr. The Lht;1 resides on some sou rce de1·icc, e . g; . , .1
disk, :-�nd must be read Lw the scn-er so that ir 111;11· se nd the <.btJ to
the diem <JI'CI'
:1 ncrll'ork . At the cli ent, rhc data is IITittcn to some sink dc,·icc, e.g , a �i·ame bu th: rfor dispL11.
Figure
2
shoii'S a rvpical cnd-ro-end ch t;1 p;nh whnc the sou 1·ce and s i n k end -poi nt 11 ork,urions .11-c run ningprotect
e
d OJKr;ning Sl'�tcm kemcls such ;1S L'�IX. Thesource d e�·icc gcncr;nes cb t;\ i n to the memo1·1· of irs connected 11 mkstatio11 . This mcmm1 is gcner;1ll1 onll addressa blc
lw
the kernel; ro ;1l loll' rhc sen·n process ro access the <.hLl, i t is ph1·sic;1l ll' copied i mo mcmor1;ld d ress:-� ble 1i1 rhc scn·cr process's a d d ress sp;Ke, i . e . ,
user sp;Ke. Plll'si cal l l' cop1·ing Lh t;l �i-om o n e mcmo1·1·
location ro ;111othcr
( m morL' gener;1 l h,
tou c h i ll[l: tilL'data r(n <1111'
reason)
is '' m;1jor bottleneck in modem 11·orksr;1 rions.Jn rr;11cll ing through the kern el, rhc cht.1 gcner;1lh· rral'els o1·er a dcl'ice Lll'cr . and <111 absrr.Krion Lll'er. The . dn·icc la1
·cr is
f)<l rt of the kcn\L' I 's 1/0 su bSI'StC I11 ;\Jld ma nages the l/0 dc1 icesh1·
buffe ring dat;\ hcn1 een the de1·ice ;md
the kernel . The Jbstracrion l:t1-cr com prises other kt:rncl subsi'Stcms th:-�r su ppon ;l hsrr<1c tions of dn·ices, prc)l'iding more COI11'l'nicm st:n·iccs foruser-Je,·e l r)rocesse
s. l:\amplcs of kernel abstractionl a1·er sot(\,·arc i nclud e tile S\'stems :tnd com munication
protocol stacks: <1
file SI'Stem com·err�
disk blocks into ti les, and J comm unicnion prorocol stJck colli'Cl'tsnetwork packers into da ragr:m1s or stream segments.
Sometimes, a kernel im plcmenLHion m;\1' uw.e p lll·si
cal CO[Wing of data be tll'een the cie1 icc i<li'L'I' ;\llci the ;1 bstr;h:tion Lwc r; in hct, cop1·ing 11\;\1' e1'Cn occur ll'ithin rhcse Lll'crs.
APPLICATION LAYE R
KERNEL
SPACE DEVICE LAYER
FronJ kernel space, rhc dar;\ lll<11' trave l across several more b1·ers i n user spxe, such as the sramhrd I/0
Lwcr <l lJd the :-tpplication l<1ycr. The sr:-�nd:-�rd l /0 laver hu ftcr.'> T/0 cb ta in Ll rge chu nks to m i n i m ize the n u mber ot' I /0 S\'Stcm c.1 l l s . The :-�pplicnion l;\\·er gen
Cl';lllv h;ls irs oll'n buft<.:rs ll'here l/0 cht;\ i s copied. I:-' rom the senu· process in user sp.Kc, rhe dat;1 i� rhen gi1 en to the nc rll'ork ;1lb pter; this 111<\1' cause rra11 sfcrs across user process 1 ;1\'l'I"S and rhen across rhe ke rnel L11·crs. The lb L1 is then n·anstCrred o1·cr rhe net ,,·ork, 11·hich gcncral lv consists of a ser of l i n ks con
llL'Ctcd h1 routers. I f rhc romus ha1c kemcl� 11·hose
soft11 :1rc stntctmc is l i ke tbJt described ,\ bo,·c , '' simi Llr ( bu r rvpic1lil' sim.plcr) i nrramac h i nc da t;1 rranst(: r p;lth II ill ;lpp(l·
Fin<1 1 h', the tbta arri1·cs <1t rhc cl ient's II'Orksr:ttion . There, rhe dara tr:-�1·c ls in a simiLlr ,, . .,,. ;\S ll'as descri bed t( Jr the sen er's 1\'Clrkst;ltion : ti-om the nct11·ork <1Lhpter, across t he kernel, through the c I icnt pmccss 's ;ld d rcss sp.1ce,
:md
across the kernel ag;1 i n , ti n:t l h' reaching the �ink de1 icc.From this ;Jn;dl'sis, one can smrnisc 11'111' throughput
llotrlelJccks ofrcn occur at tl1e end pointS of rhe end
to-end Lbt;1 tr;mstCr parh, assuming sufticienrlv fJsr
h;lrd,,-,,re d e1·ices :-�nd Ull1\111unic;Hion l i nks. Ar rhc end poims, rhcrc ma1· be signitic111r d at;1 cop1 ing :-�s the Lbra rr�11·erscs the 1·ari ous sof-(\vare lal'crs, <111d there is l'r<ltection-dom,1 i n cmssing r kcmel to u s t:r ro kern el ), ;\11Hlllg orhn tl11Ktions. The 01·erheads caused bl' these functions, d irccrh-
;md in
directll', can lx signitic;mt.C:on sequenth·, 11·e t(Kused on impro1·ing operating
s1•srem l /0 :md nenvork software, including opti
mi ;;ltions tor the t(;ur possible process/d e,·icc d :n�1
tLlnstCr scc narim: process to process, process to de1·ice , dc,·ice to process, and dc1·ice
ro
de1·icc, ll'i th speci;l l c\L·e i n ;1ddrcssing c1ses 11 hue either source o r sinkU S E R SPAC E KERN SPAC �L
�
APPLICATION LAYE R CLIENT STANDARD l/0 LAYER ABSTRACTION LAYER DEVICE LAY ER NETWORK ADAPTER S I N K DEVICE NETWORK Figure2
A tl E n d - t o - E n d J);na l'arh ti·om a So urce D c 1 icc on O n e \\'orkstation to ;1 S i n k D c 1 icc on :\ t w r h c r \Vmk�tJtion
device is ;� network �lliaptcr. I n this p�1per, \\'C usc
rhe term
dolo lntl l4.erJII"oblem
to rekr to the problemof reducing these ovcrhc;�ds ro ac hie1·c high through pur between ;� sou rce device Jlld a si nk device , either
of \\"hich can he ;� net11·ork aLhpter 11·irhin � � si ngle
worksr:� rion.
Al though the data rr:mskr problem mal' also oist in
i ntermediate routers, i r docs so to :1 much l esser
degree than with end-user workstations ( assu ming
m odem router software and h:�rd\\'�lrc technoJog,·).
This is beca use of�� rourcr's simpli�ied execution envi
ronmellt :�nd irs red uced needs �()r transfers Kross
multiple protected domains. H o\\"CI"Cr, there is noth
ing rh�n pr
e
cl ude
s rhc :�ppJicnion of the techniquesdi scussed in this paper to rourer SOlT\\";Jre. In rict, si nce
ll"e used gcnerJI - purposc workstations �or routers ro