Procesos, métodos y actitudes en Matemáticas

processing system is based on the Alpha 64-bit RISC m icroprocessor and is designed for fast CPU performance, low memory latency, and high memory and 1/0 bandwidth. The server's

1/0 subsystem contributes to the ach ievement of these goals by implementing several innova tive design techniques, primarily in the system bus-to-PCI bus bridge. A partial cache l ine write technique for small transactions reduces traffic on the system bus and improves memory latency. A design for deadlock-free peer-to-peer transac tions across multiple 64-bit PCI bus bridges reduces system bus, PCI bus, and CPU utilization by as much as 70 percent when measured in DIGITAL AlphaServer 4 1 00 M EMORY CHAN NEL cl usters. Prefetch logic and buffering supports very large bursts of data without stalls, yielding a system that can a mortize overhead and deliver perfor ma nce limited only by the PCI devices used in the system.

I

Samuel H. Dnncan

Craig D. Keefer

Thomas A. McLaughlin

The AlpbaServer 4 100 is a symmetric m ultiprocess ing system based on the Alpha 2 1 1 64 64-bit RJSC microprocessor. This midrange system supports one to four crus, one to tou r 64-bit-widc peer bridges to the peripheral component interconnect ( PCI ) , and one to tou r logical memory slots. The goals for the AlphaServer 4100 system were fast CPU performance, low memory latency, and high memory and I/0 bandwidth. One measure of success in achieving these goals is the AIM benchmark multiprocessor perfor mance results. The AJphaServer 4 1 00 system was audited at 3,337 peak jobs per minute, wi th a sus tained number of3,0 1 8 user loads, and won the AI M Hot Iron price/performance award in October 1 996.' The subject of this paper is the contribution of the T/0 su bsystem to these h igb-pertonnance goals. In an in- house test, 1/0 performance of an AJphaServer 4 1 00 system based on a 300-mcgabertz ( MHz) processor shows a 1 0 to 1 9 percen t improvement in I/0 when compared with a previous-generation midrange Alpha system based on a 350-MHz proces sor. Reduction in CPU u tilization is particularly bene ficial for applications that usc small transfers, e .g., transaction processing.

1/0 Subsystem Goals

The goal for the AlphaServer 4 100 I/0 su bsystem was to increase overall system performance by

• Reducing CPU and system bus utilization for all applications

• Delivering full I/0 bandwidth, specifically, a band width limited only by the PCI standard protocol, which is 266 megabytes per second ( M B/s ) on 64-bit option cards and 1 3 3 MB/s on 32-bi t option cards

• Minimizi ng latency t()r a l l direct memory access (DMA) and programmed I/0 ( PI O ) transactions Our discussion t(xuses on several i nnovative techniq ues used in the design of the I/0 subsystem 64-bit-wide peer host bus bridges that dramatical ly red uce CPU and bus utilization and deliver full PCI bandwidth:

• A p:�rrial cach e l i n e write technique for coherent DMA writes. This technique permits :�n r;o device ro insert data t hat is smaller than a cache line or block, i nto the cache-coherent domain without

fl

rst obtain i ng ownership of the cache bJock and pcr tc.Jrming a read -modit)r-write operation. Partial cache l i ne writes red uce traffic on the svstem and improve latency, p<lrticularly t(x , passed in a MEMORY C HA N N E L cluster.'

• Support tor device-initiated transactions that target

other devices ( peers ) across mu ltiple ( peer) PC!

buses. Peer-to-peer transactions reduce svstem bus utilization, PC! bus u ti l i zation, and CP

U

u ti

lization b y a s much a s 7 0 percent w h e n measured i n

M EMORY CHANNEL clusters. I n testing, w e ran

a M EM O RY CHANN E L application without peer

to-peer DMA, and observed 85 percent CPU

uti lization ; running the same application with peer

to-peer DMA enabled, we observed 1 5 percent

CPU utilization. The peer-to-peer technique is

successfu lly i mplemented on the AlphaServer 4 100

system without causing deadlocks .

• _{Large bursts of PCI-device-initiated}_DMA_data_to or trom })'Stem memory. 1/0 subsystem support tor large bursts of DMA data enables efficient PC! bus utilization because fixed blJS latency can be amortized over these large tr:�nsactions.

• Prcktched read data :�nd posted write dat:� buffer

ing designed to keep up with the highest pertor

nuncc PC! devices. When used i n combination with the PCI delayed-read protocol , the buftering <llld prefetching appro<1ch al lows the system to avoid PCI bus st:� l ls i ntroduced by the bridge d ur ing PC! -device-initiated transactions.

The tol lowing overview of the system concentrates

on the areas in which these techniques arc used to

en hance performance, that is, efficiency in the system

bus and in the PC! bus bridge. In s u bsequent sections, we describe i n greater detail the performance issues,

other possi bk approaches to resolving the issues, and

the techniques we developed . vVe conclude the paper

witll performance results.

Alpha Server 4100 System Overview

The AlphaServer 4 100 system shown in Figure 1 includes four CPUs connected to the system bus, which comprises the data and error correction code ( ECC ) and the command and add ress l i nes. Also connected to the system bus arc main memory and a si ngle module with two independent peer PCI bus bridges. The single mod u le, the PCI bridge mod u le, provides the physical and the logical bridge be[\Vecn the svstem bus and the PC! buses. Each independent

peer PC! bus bridge is constructed of a set of three

Digital Technical journal Vol . 8 No. 4 1 996

application-specific i ntcgr:ncd circuit ( ASIC) chips, one control chip, and t\\'O sliced data path chips.

The two independent PCI bus bridges arc the inter bees between the system bus and their respective PC! buses. A PC! bus is 64 or 32 bits wide, transferring dat:� at a peak of266 M B/s or 1 33 M B/s, respectively. In the AlphaServcr 4 1 00 system, the PC! b uses arc 64 bits wide.

The PCT buses connect to �1 PC:! backplane mod u l e with a nu mber o f expansion slot s a n d a bridge t o the Extended Ind ustry Standard Architecture ( EISA) bus. In Figure I , each PC! bus is shown to support up to r(Jur devices in option slots.

The AlphaScrver 4000 series <llso supports a config u r:�rion in which two of the C PU cards are replaced with rwo additional independent peer PC! bus bridges. In the quad PCI bus configuratio n , there arc 1 6 option slots avai lable t(Jr PCI devices, at the cost or· bou nd i ng the system to a maxim u m of two CPUs and rwo logical memorv slots. This qu:�d PCI bus con figuration is shown in hgurc 2 .

!Ylost or· the tech niques descri bed i n this paper arc implemented in the PC! bus bridge . The partial cache l i ne write tech n ique, presented next, is also designed

i nto the protocol on the system bus and into the CPU c:�rds.

Improvements in CPU and System Bus Utilization through Use of Partial Cache Line Writes

I nefficient use of system resources can limit perfor

mance on heavi lv loaded systems. Svstem designers

must be attcnti,·e to potential pcd(Jrmance bottle necks beyond the com mon ly add ressed CPU speed , cache loop rime, and CPU memory latency. Our tCJCus

in rhc I/0 su bsystem design was to balance system

pcrtcm11:111Ce in the face of ;J wide range of I/0 device

behaviors. vVe therdixe implemented techniques that minimize the load on the PC ! bus, the system bus, and

the Cl�Us. The technique described in this section partial cache line writes-red uces the load on the sys

tem bus and impro\'CS o\'erall system pert(mnancc . !'vLmy first- and second -generation PCI controller devices were designed to operate in platforms that support 3 2 - byte c:�chc l ines and 1 6- bvtc write butTers. I t is common for an older PC! device to l imit the amount of DJ\IlA d ata it reads or writes to match this characteristic of computers rhat were on the market at the time those devices were designed . Some classes of devices wil l , bv their nature, <llways limit the amount ofcbta in J burst transaction .

As do most Alpha pJatt(mm, the AlphaServer 4 1 00 svstcm su pports a 64-bytc cJChc line that is t\\'ice that of other common svstcms . When a PC! de,·ice pcr tcm11S �1 memory wri tc of less than a complete cache

li ne, the system must merge the d ata into a cache line while maintai n i ng a consistent ( coherent) view of

Figure 1

PCI BACKPLANE MODULE

STANDARD 1/0 PORTS

SLOTS

I - - - -

---

--I

:

PCI BRIDGE MODULE

:

ONE DEDICATED PCI AND THREE SHARED PCI/EISA SLOTS

:

PCI BUS BRIDGE PCI BUS BRIDGE

:

MEMORY

I I

:

_ - -

I _ _ _ _ J

COMMAND/ADDRESS DATA AND ECC

!

CPU CARD CPU CARD CPU CARD CPU CARD

SYSTEM BUS

AlphaServer 4 1 00 System with Four CPUs, Two 64-bit Buses

Figure 2

PCI BACKPLANE MODULE

STANDARD 1/0 PORTS

��{s?�w g g g � i��ri�:��:A

COMMAND/ADDRESS DATA AND ECC

I - -- - - - -- - - -I

I PCI BRIDGE MODULE 1

I I

i

PCI BUS BRIDGE PCI BUS BRIDGE

i

L---}---- ---}----

_ _ _ _

]

MEMORY

SYSTEM BUS

t

:---!--- ---!---

- - -

:

CPU CARD

i

_{PCI BUS BRIDGE} _{PCI BUS BRIDGE}

i

CPU CARD

I I

:

PCI BRIDGE MODULE I

I

��6w g g g

AlphaServer 4000 System with Two CPUs, Four 64-bit Buses

memory tor all CPUs on the system bus. This merging

of write data i nto the cache-coherent donlJi n is typi

cally done on the PC! bus bridge, which reads the

cache line, merges the new b�'tes, :�nd writes the cache line b::�ck out to memory. The read -modi r\'-wrin: m ust be pert<:>rmed as an atomic operation to m:l i ntain memory consistency. For the d u ration of tht: atomic read-modi

�

1-write operation, tht: system bus is busy. Const:qut:ntly, a write of less than a cacht: lint: rt:su l ts in a rcad-modi

�

�-write that takt:s at least thrct: timt:s :�s m:�ny cycles on the system bus as a simple 64-byte alignt.:d cache lint: write.

For example, if we bad used an urlit:r D I G I TAL implementation of a system bus protocol on the Alph aSt:rver 4 1 00 system, an 1/0 d evict: operation

on the l)CJ that performed a single 1 6- bytt.:-al igned memory write wou ld have consu mt:d system bus

bandwidth that cou ld have moved 256 bytt:s of data ,

o r 1 6 times t h e amount o f data . W e tht:rdi:>rt: h a d to find a more e fficient approach to writing su bblocks into tht: cache-coherent domain.

Wt: first examined opportu nities ti:>r efficiency gains i n tht: memory system 3 Tht: AlphaServn 4 1 00 mem ory system i n terrace is 1 6 bytes wide; a 64- byte cache line read or write takes four cycles on the system bus. The memory modules themselves can be designed to nnsk one or more of the writes and al loll' :1l igncd blocks that arc mu ltiples of 1 6 byres to be ll'rittcn to memory i n a single system bus trans�lction. Tht: prob lem with permitting a Jess than compkte c:1che line write, i.e., less than 64 bytes, is that the writt: goes to main memorv, but the only up-to-date/complt:tc copv of a cache line may be i n a CPU card's cache.

To permit the more efficient partial cac he line write operations, we modified the system bus cac he cohuency protocol . W h e n a PCI b u s bridge issues a parti;�l CKhc line write on the system bus, c:.<ch CPU

c;�rd perti:mns a cache looku p to sec if t h e target o f t h e write is d irty. I n t h e evt:nt that tht: target cache block is dirtv, the CPU sign;�ls tht: PC! bus bridge bdi:>re rhe end of the partial wrirt: . On d i rty partial eacht: l i nt: write transactions, the bridge simp!�, per ti:ml1S <1 second transaction as a read - modit\1-write . I f the t<1rgct cache block is nor dirty, r h t: operJtion com pletes in a singk systt:m bus transaction.

Add rt:ss traces taken d uring product developmen t

were simu lated t o determine the ti·equt:ncy of dirty

cache blocks that are targets of DMA writes. Our sim u lations showed that, tor the address trace wt: used,

tl-cquency was extremely rare . Mt:asurcmcnr taken

ti·om St:VeraJ appJiutiOJlS and benchmarks con fl rmed

that a dirty cache block is almost never asserted with

a parri;�l cache line write .

T h e D M A transft:r of blocks thJt arc aligned multiples of 1 6 bytes but less than :1 cache l i ne is ti:>ur

ti mes more efficient in the 4 100 svstem than i n earlier D I G I TAL implementations.

Vol . � N o . 4 J l)96

MmTnKnt of blocks of less than 64 lwtcs is important ro :1pplieation performance because there

are h igh -pc r ti:mnance dc\'iees that move less thJn

64 byres. One cx<�mple is D I GITAL's M LM O RY

C H AN N E l . Jdaptcr, which moves 3 2 - byte blocks i n �1

burst.2 As M EM O RY CHANNEL adapters move l:1rge

n u mbers of blocks that art: all Jess than a cache l i n e of data, the 1/0 su bsystem partial cache line write tC;�rure improves system bus utilization and elimi nates the system bus as a bottleneck. Message latency across the

tab ric of an Alph:1Servn 4 100 tv!EMORY C H AN !\I I -: ! , cluster (version 1 .0 ) is <1pproximate l\' 6 microseconds ( fJ.s ) . Thnc art: two DMA writes i n the message : the first is a message, and tht: second is a flag to validate the

message . Thest: DMA writes on the target AlphaSenn

4 100 contribute to mcssJgc brency. The imprm e ment in latencv provi ded by tht: partial cache line 11ritc tCature is approximately 0 . 5 11-s per wri te. With two writes per message, latt:ney is reduced b�' approx i

In document BOLETÍN OFICIAL DE LA COMUNIDAD DE MADRID (página 110-117)

Procesos, métodos y actitudes en Matemáticas

I

fl

U

---

:

:

:

:

:

!

!

!

����{s?�w g g g � i���ri�:��:A

i

i

L---}---- ---}----

]

t

:---!--- ---!---

:

i

i

:

I

I

������6w g g g

�

�

��{s?�w g g g � i��ri�:��:A

��6w g g g