processing system is based on the Alpha 64-bit RISC m icroprocessor and is designed for fast CPU performance, low memory latency, and high memory and 1/0 bandwidth. The server's
1/0 subsystem contributes to the ach ievement of these goals by implementing several innova tive design techniques, primarily in the system bus-to-PCI bus bridge. A partial cache l ine write technique for small transactions reduces traffic on the system bus and improves memory latency. A design for deadlock-free peer-to-peer transac tions across multiple 64-bit PCI bus bridges reduces system bus, PCI bus, and CPU utilization by as much as 70 percent when measured in DIGITAL AlphaServer 4 1 00 M EMORY CHAN NEL cl usters. Prefetch logic and buffering supports very large bursts of data without stalls, yielding a system that can a mortize overhead and deliver perfor ma nce limited only by the PCI devices used in the system.
I
Samuel H. Dnncan
Craig D. Keefer
Thomas A. McLaughlin
The AlpbaServer 4 100 is a symmetric m ultiprocess ing system based on the Alpha 2 1 1 64 64-bit RJSC microprocessor. This midrange system supports one to four crus, one to tou r 64-bit-widc peer bridges to the peripheral component interconnect ( PCI ) , and one to tou r logical memory slots. The goals for the AlphaServer 4100 system were fast CPU performance, low memory latency, and high memory and I/0 bandwidth. One measure of success in achieving these goals is the AIM benchmark multiprocessor perfor mance results. The AJphaServer 4 1 00 system was audited at 3,337 peak jobs per minute, wi th a sus tained number of3,0 1 8 user loads, and won the AI M Hot Iron price/performance award in October 1 996.' The subject of this paper is the contribution of the T/0 su bsystem to these h igb-pertonnance goals. In an in- house test, 1/0 performance of an AJphaServer 4 1 00 system based on a 300-mcgabertz ( MHz) processor shows a 1 0 to 1 9 percen t improvement in I/0 when compared with a previous-generation midrange Alpha system based on a 350-MHz proces sor. Reduction in CPU u tilization is particularly bene ficial for applications that usc small transfers, e .g., transaction processing.
1/0 Subsystem Goals
The goal for the AlphaServer 4 100 I/0 su bsystem was to increase overall system performance by
• Reducing CPU and system bus utilization for all applications
• Delivering full I/0 bandwidth, specifically, a band width limited only by the PCI standard protocol, which is 266 megabytes per second ( M B/s ) on 64-bit option cards and 1 3 3 MB/s on 32-bi t option cards
• Minimizi ng latency t()r a l l direct memory access (DMA) and programmed I/0 ( PI O ) transactions Our discussion t(xuses on several i nnovative techniq ues used in the design of the I/0 subsystem 64-bit-wide peer host bus bridges that dramatical ly red uce CPU and bus utilization and deliver full PCI bandwidth:
62
• A p:�rrial cach e l i n e write technique for coherent DMA writes. This technique permits :�n r;o device ro insert data t hat is smaller than a cache line or block, i nto the cache-coherent domain without
fl
rst obtain i ng ownership of the cache bJock and pcr tc.Jrming a read -modit)r-write operation. Partial cache l i ne writes red uce traffic on the svstem and improve latency, p<lrticularly t(x , passed in a MEMORY C HA N N E L cluster.'• Support tor device-initiated transactions that target
other devices ( peers ) across mu ltiple ( peer) PC!
buses. Peer-to-peer transactions reduce svstem bus utilization, PC! bus u ti l i zation, and CP
U
u ti lization b y a s much a s 7 0 percent w h e n measured i n
M EMORY CHANNEL clusters. I n testing, w e ran
a M EM O RY CHANN E L application without peer
to-peer DMA, and observed 85 percent CPU
uti lization ; running the same application with peer
to-peer DMA enabled, we observed 1 5 percent
CPU utilization. The peer-to-peer technique is
successfu lly i mplemented on the AlphaServer 4 100
system without causing deadlocks .
• Large bursts of PCI-device-initiated DMA data to or trom })'Stem memory. 1/0 subsystem support tor large bursts of DMA data enables efficient PC! bus utilization because fixed blJS latency can be amortized over these large tr:�nsactions.
• Prcktched read data :�nd posted write dat:� buffer
ing designed to keep up with the highest pertor
nuncc PC! devices. When used i n combination with the PCI delayed-read protocol , the buftering <llld prefetching appro<1ch al lows the system to avoid PCI bus st:� l ls i ntroduced by the bridge d ur ing PC! -device-initiated transactions.
The tol lowing overview of the system concentrates
on the areas in which these techniques arc used to
en hance performance, that is, efficiency in the system
bus and in the PC! bus bridge. In s u bsequent sections, we describe i n greater detail the performance issues,
other possi bk approaches to resolving the issues, and
the techniques we developed . vVe conclude the paper
witll performance results.
Alpha Server 4100 System Overview
The AlphaServer 4 100 system shown in Figure 1 includes four CPUs connected to the system bus, which comprises the data and error correction code ( ECC ) and the command and add ress l i nes. Also connected to the system bus arc main memory and a si ngle module with two independent peer PCI bus bridges. The single mod u le, the PCI bridge mod u le, provides the physical and the logical bridge be[\Vecn the svstem bus and the PC! buses. Each independent
peer PC! bus bridge is constructed of a set of three
Digital Technical journal Vol . 8 No. 4 1 996
application-specific i ntcgr:ncd circuit ( ASIC) chips, one control chip, and t\\'O sliced data path chips.
The two independent PCI bus bridges arc the inter bees between the system bus and their respective PC! buses. A PC! bus is 64 or 32 bits wide, transferring dat:� at a peak of266 M B/s or 1 33 M B/s, respectively. In the AlphaServcr 4 1 00 system, the PC! b uses arc 64 bits wide.
The PCT buses connect to �1 PC:! backplane mod u l e with a nu mber o f expansion slot s a n d a bridge t o the Extended Ind ustry Standard Architecture ( EISA) bus. In Figure I , each PC! bus is shown to support up to r(Jur devices in option slots.
The AlphaScrver 4000 series <llso supports a config u r:�rion in which two of the C PU cards are replaced with rwo additional independent peer PC! bus bridges. In the quad PCI bus configuratio n , there arc 1 6 option slots avai lable t(Jr PCI devices, at the cost or· bou nd i ng the system to a maxim u m of two CPUs and rwo logical memorv slots. This qu:�d PCI bus con figuration is shown in hgurc 2 .
!Ylost or· the tech niques descri bed i n this paper arc implemented in the PC! bus bridge . The partial cache l i ne write tech n ique, presented next, is also designed
i nto the protocol on the system bus and into the CPU c:�rds.
Improvements in CPU and System Bus Utilization through Use of Partial Cache Line Writes
I nefficient use of system resources can limit perfor
mance on heavi lv loaded systems. Svstem designers
must be attcnti,·e to potential pcd(Jrmance bottle necks beyond the com mon ly add ressed CPU speed , cache loop rime, and CPU memory latency. Our tCJCus
in rhc I/0 su bsystem design was to balance system
pcrtcm11:111Ce in the face of ;J wide range of I/0 device
behaviors. vVe therdixe implemented techniques that minimize the load on the PC ! bus, the system bus, and
the Cl�Us. The technique described in this section partial cache line writes-red uces the load on the sys
tem bus and impro\'CS o\'erall system pert(mnancc . !'vLmy first- and second -generation PCI controller devices were designed to operate in platforms that support 3 2 - byte c:�chc l ines and 1 6- bvtc write butTers. I t is common for an older PC! device to l imit the amount of DJ\IlA d ata it reads or writes to match this characteristic of computers rhat were on the market at the time those devices were designed . Some classes of devices wil l , bv their nature, <llways limit the amount ofcbta in J burst transaction .
As do most Alpha pJatt(mm, the AlphaServer 4 1 00 svstcm su pports a 64-bytc cJChc line that is t\\'ice that of other common svstcms . When a PC! de,·ice pcr tcm11S �1 memory wri tc of less than a complete cache
li ne, the system must merge the d ata into a cache line while maintai n i ng a consistent ( coherent) view of
Figure 1
PCI BACKPLANE MODULE
STANDARD 1/0 PORTS
SLOTS
I - - - -
---
--I:
PCI BRIDGE MODULE:
ONE DEDICATED PCI AND THREE SHARED PCI/EISA SLOTS
:
PCI BUS BRIDGE PCI BUS BRIDGE:
MEMORYI I
:
_ - -I _ _ _ _ J
COMMAND/ADDRESS DATA AND ECC
!
!
!
CPU CARD CPU CARD CPU CARD CPU CARD
SYSTEM BUS
AlphaServer 4 1 00 System with Four CPUs, Two 64-bit Buses
Figure 2
PCI BACKPLANE MODULE
STANDARD 1/0 PORTS
����{s?�w g g g � i���ri�:��:A
COMMAND/ADDRESS DATA AND ECC
I - -- - - - -- - - -I
I PCI BRIDGE MODULE 1
I I
i
PCI BUS BRIDGE PCI BUS BRIDGEi
L---}---- ---}----
_ _ _ _]
MEMORY
SYSTEM BUS
t
:---!--- ---!---
- - -:
CPU CARD
i
PCI BUS BRIDGE PCI BUS BRIDGEi
CPU CARD
I I
:
PCI BRIDGE MODULE II
I
������6w g g g
AlphaServer 4000 System with Two CPUs, Four 64-bit Buses64
memory tor all CPUs on the system bus. This merging
of write data i nto the cache-coherent donlJi n is typi
cally done on the PC! bus bridge, which reads the
cache line, merges the new b�'tes, :�nd writes the cache line b::�ck out to memory. The read -modi r\'-wrin: m ust be pert<:>rmed as an atomic operation to m:l i ntain memory consistency. For the d u ration of tht: atomic read-modi
�
1-write operation, tht: system bus is busy. Const:qut:ntly, a write of less than a cacht: lint: rt:su l ts in a rcad-modi�
�-write that takt:s at least thrct: timt:s :�s m:�ny cycles on the system bus as a simple 64-byte alignt.:d cache lint: write.For example, if we bad used an urlit:r D I G I TAL implementation of a system bus protocol on the Alph aSt:rver 4 1 00 system, an 1/0 d evict: operation
on the l)CJ that performed a single 1 6- bytt.:-al igned memory write wou ld have consu mt:d system bus
bandwidth that cou ld have moved 256 bytt:s of data ,
o r 1 6 times t h e amount o f data . W e tht:rdi:>rt: h a d to find a more e fficient approach to writing su bblocks into tht: cache-coherent domain.
Wt: first examined opportu nities ti:>r efficiency gains i n tht: memory system 3 Tht: AlphaServn 4 1 00 mem ory system i n terrace is 1 6 bytes wide; a 64- byte cache line read or write takes four cycles on the system bus. The memory modules themselves can be designed to nnsk one or more of the writes and al loll' :1l igncd blocks that arc mu ltiples of 1 6 byres to be ll'rittcn to memory i n a single system bus trans�lction. Tht: prob lem with permitting a Jess than compkte c:1che line write, i.e., less than 64 bytes, is that the writt: goes to main memorv, but the only up-to-date/complt:tc copv of a cache line may be i n a CPU card's cache.
To permit the more efficient partial cac he line write operations, we modified the system bus cac he cohuency protocol . W h e n a PCI b u s bridge issues a parti;�l CKhc line write on the system bus, c:.<ch CPU
c;�rd perti:mns a cache looku p to sec if t h e target o f t h e write is d irty. I n t h e evt:nt that tht: target cache block is dirtv, the CPU sign;�ls tht: PC! bus bridge bdi:>re rhe end of the partial wrirt: . On d i rty partial eacht: l i nt: write transactions, the bridge simp!�, per ti:ml1S <1 second transaction as a read - modit\1-write . I f the t<1rgct cache block is nor dirty, r h t: operJtion com pletes in a singk systt:m bus transaction.
Add rt:ss traces taken d uring product developmen t
were simu lated t o determine the ti·equt:ncy of dirty
cache blocks that are targets of DMA writes. Our sim u lations showed that, tor the address trace wt: used,
tl-cquency was extremely rare . Mt:asurcmcnr taken
ti·om St:VeraJ appJiutiOJlS and benchmarks con fl rmed
that a dirty cache block is almost never asserted with
a parri;�l cache line write .
T h e D M A transft:r of blocks thJt arc aligned multiples of 1 6 bytes but less than :1 cache l i ne is ti:>ur
ti mes more efficient in the 4 100 svstem than i n earlier D I G I TAL implementations.
Vol . � N o . 4 J l)96
MmTnKnt of blocks of less than 64 lwtcs is important ro :1pplieation performance because there
are h igh -pc r ti:mnance dc\'iees that move less thJn
64 byres. One cx<�mple is D I GITAL's M LM O RY
C H AN N E l . Jdaptcr, which moves 3 2 - byte blocks i n �1
burst.2 As M EM O RY CHANNEL adapters move l:1rge
n u mbers of blocks that art: all Jess than a cache l i n e of data, the 1/0 su bsystem partial cache line write tC;�rure improves system bus utilization and elimi nates the system bus as a bottleneck. Message latency across the
tab ric of an Alph:1Servn 4 100 tv!EMORY C H AN !\I I -: ! , cluster (version 1 .0 ) is <1pproximate l\' 6 microseconds ( fJ.s ) . Thnc art: two DMA writes i n the message : the first is a message, and tht: second is a flag to validate the
message . Thest: DMA writes on the target AlphaSenn
4 100 contribute to mcssJgc brency. The imprm e ment in latencv provi ded by tht: partial cache line 11ritc tCature is approximately 0 . 5 11-s per wri te. With two writes per message, latt:ney is reduced b�' approx i