ARRAY
10. Modular Bootstrapping Transit Architecture (MBTA)
Th e Modular Boot st rapping Transit Arch it ect ure (MBTA) is a series of small mult iprocessors b ased around mult ist age rout ing net w orks composed of RN1 orMETRO rout ing component s. MBTA
int egrat es a numb er of minimal processing nodes w it h a mult ist age net w ork organized as describ ed in Sect ion 3.5.
10.1 Architecture
Figure 10.1 sh ow s t h e net w ork used for a 64-processor MBTA mach ine. Each processing node h as t w o net w ork input s and t w o net w ork out put s (
ni
=no
=2) for fault t olerance. Th e net w ork sh ow n is composed of RN1-st yle rout ing component and uses t h e dilat ion-1 rout er configurat ionin t h efinal st age so t h at t w o different rout ing component s may provide net w ork out put s from t h e net w ork t o each processing node. Since RN1 is a radix-4 rout ing component , t h e net w ork is comprised of log4(64)=3 rout ing st ages.
Figure 10.2 sh ow s t h e arch it ect ure of t h e MBTA processing nodes. Each node is composed of a RISC microprocessor (e.g. Int el’s 80960CA [MB88] [Int 89]), fast , st at ic memory, net w ork int erfaces, and support logic. Four logical net w ork int erfaces service t h e t w o connect ion int o and t h e t w o connect ion out of t h e net w ork. Th e processor performs comput at ion, init iat es net w ork communicat ions, and services non-primit ive net w ork operat ions. Th e processor is also responsib le
for t h e h igh est levels ofMRP-ENDPOINT, w h ich are not h andled b y t h e net w ork int erface. A single, h igh -speed memory b ank serves t o h old inst ruct ions and dat a for t h e processors, st ore dat a coming and going from t h e net w ork, and st ore connect ion st at us informat ion. Th e b asic node arch it ect ure also h as provisions t o support co-processors and alt ernat e forms of memory. In order t o int erface MBTA mach ines w it h exist ing comput ers and dat a net w orks, t h ere are provisions for some nodes t o accommodat e ext ernal int erfaces.
10.2 Performance
Th e MBTA arch it ect ure h as b een b alanced t o support b yt e-w ide net w ork connect ions running at 100 MHz. Th e net w ork int erfaces send dat a from t h e fast , st at ic memory and receive dat a int o t h e memory, as w ell. Consequent ly, each net w ork int erface requires 100 megab yt es/ second (100 MB/ s) of b andw idt h int o memory during sust ained dat a t ransfers. Th e processor is running at 25 MHz and may read up t o one w ord, or four b yt es, per cycle during b urst memory operat ions.
To prevent t h e processor from st alling, it , t oo, needs 100 MB/ s of b andw idt h int o memory. To run all net w ork int erfaces and t h e processor simult aneously at full-speed, w e w ould need 500 MB/ s of b andw idt h int o memory. To simplify t h e prob lem, w e rest rict operat ion so t h at only one net w ork input may b e feeding dat a int o t h e net w ork at a t ime. Th is rest rict ion limit s t h e cont ent ion in t h e net w ork w h ile giving us t h e fault t olerance b enefit s of h aving t w o connect ions int o t h e net w ork.
To provide t h e 400 MB/ s of b andw idt h required, w e use 64-b it w ide, 20 ns, synch ronous SRAM
Sh ow n h ere is t h e net w ork for a 64-processor MBTA mach ine composed of RN1 rout ing component s.
Figure 10.1: MBTA Rout ing Net w ork
Memory
Net In 0
Net Out 0 Net
In 1
Net Out 1
A d d r e s s B u s ( 30)
D a t a B u s ( 64)
’374
’374
’373
i960
’520 3−stagepipeline
<64:32> <31:0> 64 64
32
64
N e t w o r k P o r t s ( by t e w i d e + c o n t r o l ) DRAM
SRAM Co−Proc
BCTL
<29:22>
’373x 2 ’373x 2
’244
s a s d
a d
s r a m_ d a t a p a
External Interface Optional
Sh ow n ab ove is t h e arch it ect ure for each MBTA node. Th e unit s out side of t h e dot t ed b ox are common t o all MBTA nodes. Wit h in an MBTA mach ine, a few nodes w ould support
ext ernal int erfaces.
Figure 10.2: MBTA Node Arch it ect ure
for t h e h igh -speed memory on a pipelined b us. Each of t h e four unit s using t h e memory get s t h e opport unit y t o read or w rit e one, 8-b yt e value t o or from memory every 80 ns. Th is allow s each unit t o sust ain 100 MB/ s dat a t ransfers w it h out int ernal b uffering as long as dat a can b e t ransferred as cont iguous doub le-w ords.
If all nodes are b usy sending dat a at 100MB/ s, a 64-processor net w ork, like t h e one sh ow n in Figure 10.1, can support a peak b andw idt h of 6,400MB/ s = 6.4GB/ s. Wit h one net w ork input and b ot h net w ork out put s in operat ion, a single node can simult aneously t ransfer up t o 300MB/ s.
Running t h eMETROcomponent describ ed in Sect ion 9.2 at 100 MHz, it t akes one cycle t o t raverse
each rout er and one cycle t o t raverse each w ire in t h e net w ork. Th e unloaded lat ency t h rough t h e net w ork,
T
unloaded, is 70 ns arising from 10 ns of lat ency t h rough each of t h e t h ree rout ingcomponent s in any pat h t h rough t h e net w ork and 10 ns of lat ency t h rough each of t h e four ch ip crossings b et w een net w ork endpoint s. If our t ech nology project ions forMETRO h old, w e could
implement a version w it h RN1-st yle pipelining and cut t h is lat ency in h alf. Alt ernat ely, if w e could cycle t h e pipelined memory b us t w ice as fast or increase t h e memory w idt h t o 128-b it s and require 16-b yt e dat a t ransfers, w e could support 200 MB/ s net w ork connect ions using t h eMETROrout er at
full speed. Th is w ould cut t h e unloaded net w ork lat ency in h alf t o 35 ns. Th is ch ange w ould also doub le t h e b andw idt hfigures ab ove and cut t h e t ransmission t ime,
T
transmit, in h alf. For t h is sizeof a net w ork, t h e t ot al t ime t o communicat e a message from one node t o anot h er,