• No se han encontrado resultados

8. UBICACIÓN Y EXTENSIÓN DEL ÁREA EN ESTUDIO

8.1. CUENCA TALARA

A 10 Gb/s Ethernet controller’s combined data memory requirements represent competing architectural requirements. Specifically, control data accesses by the pro- cessors and assist units require low latency to avoid excessive processor and assist- transaction stalls. Section 3.5.2 highlights the importance of low-latency access to transaction descriptor data (a subset of control data) when increasing frame rates. On the other hand, frame data requires capacity sufficient to store all in-flight frames and bandwidth sufficient to transfer 10 Gb/s streams simultaneously to and from the network and to and from the host. No single memory technology can provide a solution that is large, low latency, high bandwidth, power efficient, and cost effective. These considerations motivate the exploration of a memory organization consisting of heterogeneous components.

Hierarchical Organizations

Figure 4.1 illustrates how a traditional hierarchy memory organization might at- tempt to satisfy the various memory throughput and latency requirements for 10 Gb/s Ethernet; this organization is similar to that of the Tigon-2 architecture. For 10 Gb/s

Ethernet, the memory structure holding frame contents must be a DRAM since DRAM is the only technology with the requisite capacity and bandwidth. In the pictured memory hierarchy, all global data sharing occurs through the main memory, which in this case holds frame data and transaction descriptors. However, DRAM’s high initial access latencies would heavily stall the processors and assist units when- ever they access main memory to read or write transaction information. Section 3.5.2 shows that increased stalls on transaction information significantly reduce overall frame throughput. Furthermore, because the technology used in that study (SRAM) has a much smaller access latency, a 10 Gb/s design using DRAM would have even worse performance. M a in M e m o r y P r o c e s s o rs D M A C h a n n e ls M e d iu m A c c e s s T r a n s a c t i o n I n f o F r a m e C o n t e n t s L o c a l M e m o r y I n s t 's T r a n s . I n f o T r a n s a c t io n I n f o F r a m e C o n t e n t s

Figure 4.1 A traditional hierarchical memory organization.

The only other hierarchical organization that could satisfy the low-latency re- quirements of the processors and assists would be one in which the processors and

assists first access low-latency, globally coherent caches. While the coherence over- heads involved may be individually smaller, such overheads are imposed per-access, and thus private references to local memory would be penalized as well. Mitigating these per-access penalties could complicate the processor core and cache controller architectures significantly and thus increase power consumption. Cost and design complexities further serve as disincentives for a cache-coherent design.

A New Approach: A Heterogeneous, Content-Specific Organization

A key observation regarding Figure 4.1 is that the competition between transac- tion information and frame data is artificial. Specifically, there are no data sharing patterns that require transaction information and frame contents to reside in the same memory structure. Figure 4.2 introduces a new memory organization that separates memory contents into different memory structures according to data type. With such an organization, frame contents may reside in DRAM while transaction information and other processor control data may reside in a low-latency structure, such as an SRAM. Furthermore, separating instruction memory from the other types of mem- ory reduces contention for those other structures and reduces processor stalls due to instruction access.

Control data accessed by the NIC firmware totals about 100 KB, including trans- action descriptors. This amount of data can easily fit in an on-chip memory. A single programmer-addressable scratch pad memory operating at 200 MHz with one

F r a m e M e m o r y P r o c e s s o r s D M A C h a n n e ls M e d iu m A c c e s s C o n t r o l D a t a M e m o r y I n s t r u c t io n M e m o r y T r a n s a c t io n I n f o T r a n s a c t io n I n f o T r a n s . In fo I n s t r u c t i o n s F r a m e C o n t e n t s F r a m e C o n t e n t s

Figure 4.2 A content-specific memory organization.

32-bit port would deliver 6.4 Gb/s of data throughput, which is slightly more than the required 5.79 Gb/s. Additional memory banks are required to satisfy the addi- tional bandwidth requirements of any dispatch and synchronization code, which the baseline 5.79 Gb/s requirement does not include. Furthermore, additional banks may be necessary to reduce latencies stemming from bank conflicts and to support future programmable functionality without reducing NIC processing throughput.

As suggested by Figures 4.1 and 4.2, processors do not need to access frame data when sending and receiving frames. Separating frame data from other memory data as depicted in Figure 4.2 frees the frame data memory solution from any artificial low-latency requirements. Notice that frame data is always accessed as four 10 Gb/s

sequential streams with each stream coming from one assist unit. Current graphics DDR SDRAM can provide sufficient bandwidth for all four of these streams. For example, the Micron MT44H8M32 can operate at speeds up to 600 Mhz, yielding a peak bandwidth per pin of 1.2 Gb/s [32]. Each of these DRAMs has 32 data pins. Hence, two of them together can provide 76.8 Gb/s peak bandwidth.

The streaming nature of the NIC’s hardware assists makes it possible to achieve near peak bandwidth from such DRAM. Providing enough buffering for two maximum- sized frames in each assist ensures that data can be transferred up to 1518 bytes at a time between the assists and DRAM. These transfers are to consecutive memory locations, so using a simple DRAM arbitration scheme that allows the assists to sus- tain such bursts will incur very few DRAM page activations and will enable peak bandwidth throughput during these bursts.

Composing these various memory pieces provides a heterogeneous organization with specialized memories for each type of data. The resultant organization consists of banked on-chip control data scratch pads, a wide on-chip memory with small instruction caches, and off-chip DRAMs for frame data. Partitioning the memory address space according to data type provides a feasible, consistent mapping that is visible to the programmer.