4. PROPUESTA CURRICULAR
4.2 FUNDAMENTACIÓN TEÓRICA DEL PROGRAMA
For an addition operation , �he 3 2 -bit words con tai n i ng the exponents are sent to the main ALU . There t hey a rc passed to t h e A a n d B port s , w h i c h fee d t h e s h i ft e r m o d u l e . These ports drive all the gate arrays i n para l l e l .
The exponents are then loaded i nto the XALU and th e sh ift-a mou nt ALU (SALU) , which com p u te s t he al ign m e n t s h i ft a m o u n t sent to t h e shifter. T h e SALU a lso generates some 2 0 branch conditions for the m icrocode. These conditions i n d i ca t e t h e s i ze o f t h e a l i g n m e n t s h i ft a n d w h e t h e r a n y s o u r c e o p e r a n d i s z e r o o r a rese rved opera n d . They a lso he l p to opt i m ize the microcode tlow.
The XAllJ, which selects the larger exponent and saves it for later use , has a 1 2 -b i t datapath and a register to hold the exponent. The size of this datapath is sufficient for the F, D , and G for mats plus a guard bit for overtlow or undertlow detection. An ALl! is provided to perform arith metic opera tions o n the exponen t . The SAUl , with a n l l -bit datapath, su btracts the exponents to determ ine the a lignment shift amount, which is always positive . The s ign man ipu lation logic also resi des in the SALU.
Next, the fract iona l part of the smaller operand is a ligned hy the shifter. This operati o n i nvolves e i t her one CPU cyc le for F format o perands or two CPU cyc l e s for the D a n d G for m a ts . The shifter unit shifts i n the tloating point format and can do a ful l 6 4 -bit shift . The l ogic that deter m i n es the rou nd bits i s related to the align ment s h i ft operation but i s phys i ca l ly l ocated in the priority encoder gate array . This gate array also conta i ns some of the shifter fu nctionality .
N i ne gate arrays are used for the shifter unit. Of those , eight make u p the datapath, the n i nth is t he c o n t rol d ev i c e . The s h i fter c a n accept ei ther a 64 -bit operand o n the A and B ports or a 3 2 -bir operand on ei ther port . The shifter gener ates a 3 2 -bit resul t that can be ei ther the h igh order or the low-order part of the answer. The
65
Floating Point in the VAX 8800 Familv
s h i fter datapath gate a rrays a rc i d ent i ca l : each
e ffectively constitutes a byte s l i ce of the design and performs a bit s h i ft of u p to seven places
Byte sh ifting is then performed by send i n g t he correct s h i fter omput to the correct byte pos i
tion . This opera t i on is fac i l i ta ted by having a l l
the outp ut s wired t o the OR ga tes a t a l l poss ible byte positions and by enabl ing the con·ecr output.
The s h i fter performs floati ng po int. inte ge r. and logical shifts, as wel l as a n umber of m i sce l
laneous fu nctions . These i n c l u d e conve rts from deci m a l - format data i n to integer format and \'icc versa . The ma s kin g of the expo n e n t f i e l d a n d the i nsertion o f t h e h idden b i t are also done by the shifter.
After t h e a l i g n m e n t s h i ft . the o u t p u t of t he
s h i fter is d i rected to the m a i n ALU on the lwpass
bus. There. the output is added to or su btracted from the fraction of the larger operand. The out put of the ALU operation is now ready to be nor
mal i zed i n the shifter. I n most cases a sma l l nor m a l i ze s h i ft of at most one b i t pos i t i o n l eft or r i g h t w i l l be s u ffic ien t . The specia lized hard
ware i n t h e s h i fter ha n d l es this case and then ro u n d s t h e r e s u l t . S h o u l d a l a r g e r s h i ft b e req u i red , then m i c rocode w i l l fi rst d irect t h e ALU res u l t to t h e priori ty e n coder g a t e a rray . There , the p osition of the leading l is fou n d a n d used t o determ i ne the norma l i ze a m o u n t for t he subsequent cyc l e .
The round ing operat ion i n the V�'( 8800 CPU is unusual i n that i t is l i mi ted to the low-order e i ght bits . Therefore . a small 8-bit adder can be
used for this opera tion . This adder is both faster and cheaper than the u s u a l met hod of u s i n g a
fu l l 64 - b i t adder. The 8 - b i t adder is a I so s u ffi c i e n t to cal c u la te t h e correct a n swer i n over
\)\). 5 percen t of the addition operations. Shou ld a carry-out be generated b y t h i s 8-bit roundi ng
add , then c learly the resu lt created i s i ncorrect . l n t h a t c a s e t h e c o m p u t e r i s t r appe d a n d mi crocode i nvoked to correct the resu l t .
Multiplication
A-; men tioned earlier, the 8BOO contains a h i gh pe rformance. custom-designed mu l tiplier a n d d i v i der u n i t . A nu mber of factors impelled u s to usc such a unit. F i rs t . m u l t i p l i cation i s a very
frequent operatio n that i s used ex tensively in
matrix mani p u lati on. For example, i n the LIN
PACK benchmark, the t i m e-critical rou t i n e con ta i ns an even mix of addi tion and m u l t i p l ication operations. '
66
Second . it was not poss i b le to succumb tO t lw temptation of using the m a i n AUJ to provide t he d ivision o pe ration . This desi re was natura l s i n ce cl i ,· i s ion is an infrequent opera t i o n . and the usc of an AUJ i n a repeated su btract and shift mode was a ppealing. For exam ple. the VAX 8600 uses the ALU for just t hat pu rpose . In t he 8 8 0 0 the
m a i n AUJ a l so c o m p u t e s the virtual a d d ress . Si nce this clatapa th is very t i me-cri tical ( i n t he
8 8 0 0 a s w e l l a s i n mos t o t h e r com pute r designs) . i t can not be a l lowed to go any slower.
I ncl uding a n extra path to accom modate d i v i
s i o n wou ld have slowed down t h i s critica l path
by around '5 ns, resul ting i n a 1 0 perce n t perfor
mance degradation for a l l operations.
J\tlorcovcr , the ava i I a b l e space for the m ulti
p l i cr and d i vider u nit was l i m i ted s i nce fl oati ng
poi nt operations a rc integrated with t he rest of the mach i n e . Approx i mately one-t h i rd of a mod
u l e ( 1 2 inches by 1 6 inc hes) was available . I n
contrast, t h e VAX 8 6 '5 0 C P U dedi cates a fu l l module to m u l t i pl i cation .
T h e c u s t o m d es ign o f t h e mu l ti pli er a n d d i v i d e r u n i t is basi ca l l y a byte s l i c e o f a l a rge word - s i z e d multipli e r a n d d i v i de r u n i t . T h e
multiplier handles 8 b i ts p e r cyc l e , the d ivider
handles I h i t . F i gu re 4 sh ows t h e c o m p l e t e 5 6 - b i t by H- bit multi pl i e r w i t h i ts e i g h t byte
s l i ce custom ch ips . Eight c h i ps arc used tO form
the requ ired word size of 64 bi ts ( 5 6 data b i ts pl us 8 gua rd b i ts ) . T h i s arrange m e n t is s u ffi c i e n t to handle F. 0 , and G format operations.
H format operations arc performed by part i tion
i n g t h e problem i n t o many smaller '5 6 - b i t m u l t i
pl i cat ions u nder m i crocode contro l .
The mult iplicand is loaded i n to the M D latch a ft e r pass i n g t h ro u g h r h c m a s k l o g i c . w h i c h c l e a r s t h e sign a n d th e e x p o n e n t f i e l d a n d i n se rt s t h e h i d d e n b i t . T h e P R latch a n d th e P R G B arc c l eared a t the s t a r t of the m u l ti p l y . The P R G B c o n t a i n s t h e g u a rd b i ts for t h e P R latc h . A t the e n d of a mul t i ply . t h i s l a t c h w i l l hold the b i ts requi red for a poss i b le norma l i za tion s h i ft a nd a lso for a rou n d i ng operation. The l east s i gn ificant eight bits of the multi plier arc
loaded i n to t he m u l t i pl ier larc h . The fi rst m u l t i
p l y cyc l e is now ready to be performed.
A '5 6 - b i t by 8-bi t mul ti pli cation is performed between the contents of the MD a nd mul tiplier latches. The result is then added to the contents of the PR latch (which i s i n i t ia l. l y zero ) and then written back i n to it with a right s h i ft of 8 b i ts . The PR latch is th us an accumu lati ng l atch and
Digital Technical journal 1Vo. 4 FehruaJ:J' l lJ87
MULTIPLICAND IN PUT MULTIPLIER I N PUT S·BIT SHIFT PRGB 64 BITS BOOTH RECODE MULTIPLIER OUTPUT
Figure 4 Multiplier and Divider Unit
conta ins the 64·bit partial product of each m u l· t i p l i c a r i on opera t i o n . T h e next 8 b i ts o f t h e m u l t i p l ier are loaded i nto the m u l tiplier larc h , ready for the next cycle. This cycling cont inues unti l the m u l t i p l icand has been m u l t i p l ied by a l l the m u l t i p l ier byres. This algorithm is si m i lar to the one u s e d in t h e VAX R 6 5 0 s c h e m e , except that that processor has a narrower data· path of 32 bits.
Notice that the least s i g n i fi ca n t byte of t h e partial product is discarded after each cycle and absorbed by the s h i ft e r u n i t . These bytes are requ i red only for the H format multi ply.
O n c e c o m p l e t e d , t h e re s u l t i s s e n t o u r