PRINCIPIO V ELIMINAR LA GLOBALIZACIÓN DEL HAMBRE

ACONTECIMIENTO APORTACIÓN Conferencia

3. ANÁLISIS DE LA INTERACCIÓN ENTRE LA COOPERACIÓN A DESARROLLO DEL CAD EN EL MARCO DEL CAD Y A SOBERANÍA ALIMENTARIA

3.5. PRINCIPIO V ELIMINAR LA GLOBALIZACIÓN DEL HAMBRE

CPU1 Core CPU2 Core FSM f1(B) f2(B) RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM

Figure 6.3: Example granularity solution configuration

Disadvantages

The disadvantages of the granularity solution starts with the decomposition of the recon- figurable components into smaller components fitting into one RM . The decomposition and the different signal interface prevent the re-usage of the reconfigurable components of a standard PR design. The decomposition is also not a simple task. It is not guaranteed that all components can be divided into smaller parts.

Another disadvantage is the interconnection network. It has to span the whole FPGA connecting all RMs. This requires additional FPGA space. The number of RMs and the used interconnection/management space has to be balanced to get a good design. The path delay of the interconnection lines between the RMs can be another problem. They could not be fast enough to support the connection speeds, required within reconfigurable components.

6.2 Granularity Problem and Hybrid Hardware

The granularity problem occur on any runtime RS where multiple different sized recon- figurable components shall be used. In the scenario of coupling processor cores and recon-

6 Granularity Problem of Runtime Reconfigurable Design Flow

figurable hardware, introduced in Section 1.2, this is also the case. The standard methods to couple processors with reconfigurable hardware are datapath-, bus-accelerator or multicore reconfiguration. Datapath accelerators commonly use a very small area, while bus accelerator are medium sized, and multicore reconfiguration requires much space on a

FPGA. Figure 6.4 gives a graphical overview of this space requirements. Each pattern

reconf CPU Core

(a) Datapath Accelerator

CPU Core reconf

reconf

(b) Bus Accelerator

CPU Core CPU Core CPU Core

Figure 6.4: Area requirements of the different usage patterns

has its unique type of use. Datapath accelerators are used to increase the instruction flexibility. It allows the appending of different instructions to the processors ISA. Bus accelerators are the most common usage pattern at the moment. It allows the configuration of different kind of accelerators into the reconfigurable area and connect these through a bus to the processor. With the multicore reconfiguration pattern the reconfigurable area is used to instantiate multiple processor cores. These cores can run on their own or form a multicore system. In this work, all these connection methods shall be combined into one system, leading to the granularity problem.

7 Multicore Reconfiguration Platform

Description

After introducing the basics of reconfiguration and NOCs and describing the granularity problem of runtime reconfigurable design flows, this chapter presents the main part of this thesis, the Multicore Reconfiguration Platform (MRP).

The MRP is a hybrid hardware system. In contrast to the existing research- and commercially available systems, the MRP uses the Xilinx PR design flow to implement its reconfigurability. The use of dynamic- or runtime reconfiguration helps to solve the granularity problem by using the granularity solution presented in Section 6.1.2. This granularity solution enables the MRP to support multiple different sized reconfigurable components, without taking component sizes into account at the initial floorplaning stage.

Inter FPGA connections are another new feature of the MRP. A packet switched net- work, called OCSN , can interconnect multiple FPGAs. Figure 7.1 displays an overview

reconﬁguration platform reconﬁguration platform support platform to host system OCSN OCSN OCSN softcore

Figure 7.1: Example MRP System Overview

of an example MRP system, consisting of three FPGAs. By adding more FPGAs to the OCSN , the reconfiguration area of the MRP is easily extensible. This extensibility helps, if applications require more reconfiguration space during runtime.

As Figure 7.1 shows, a MRP system is divided into support- and reconfiguration platforms. The first provides access to system resources through the OCSN , like BRAM ,

DDR RAM , General Purpose Input Output (GPIO), USB controllers, and mass storage

and the second provides many RMs. This setup allows a maximum of reconfigurable space, while still supporting additional hardware resources. The number of platforms is only limited by the addressing space of the OCSN .

The platforms and the host system, such as a server or workstation, are also connected through the OCSN . To support high speed connection between the MRP and its host system, the connection is implemented using 1Gbit Ethernet as its physical layer. As an alternative to a full featured host system, the support platform can provide a soft-core

SoC connected to the OCSN . This SoC can control the MRP and distribute hardware

7 Multicore Reconfiguration Platform Description

Except for the Convey HC1, most of the other hybrid systems, suffer from direct operating system support. The MRP is directly integrated in the Linux OS . The device drivers provide a network API to communicate with all OCSN components and to configure the RMs.

The remainder of this chapter introduces the OCSN in Section 7.1, the support plat- form in Section 7.2 and the reconfiguration platform in Section 7.3. Furthermore, it describes the OS support in Section 7.4 and the design flow for working with the MRP in Section 7.5.

7.1 On Chip Switching Network

The requirements for a NOC , which interconnects the support and reconfiguration plat- form are diverse.

First, the NOC has to support the interconnection of multiple FPGAs with different physical connections and variable signal lengths. FPGA boards can be interconnected by Ethernet, CAN, simple wires using some kind of serial protocol like SPI or RS232, or other interconnection schemes.

Scaleability is another very important requirement. Adding another platform or com- ponent should not lead to the reconstruction of the whole NOC .

The network should support broadcast and unicast connections because information has to be distributed through the network very fast and certain components require a lot of data transfer.

Because many components participate in this network, the hardware requirements for connecting one component to the network should be as small as possible.

Most networks cannot satisfy all these requirements. For example, a bus is not scaleable and does not permit multiple components to communicate concurrently. But a static indirect packet switched network fulfils all the requirements.

The OCSN is a static indirect packet switched network. It supports the intercon- nection of multiple FPGA boards by using bridges through different physical connection and different protocols. It is limited scaleable by adding components to network switches and by increasing their number. Broadcast and unicast packet transmission is supported by routing all broadcast packets to all outgoing connections of a network switch. The usage of network switches for most of the network organisation reduces the interface size in the network devices.

The OCSN uses the OSI model to divide functionality into layers to ease the adaption to different hardware and software, and standardise the interconnection points. There- fore, the OCSN description starts with the definition of the physical layer, walking up to the application layer. All these layers are implemented in hardware, without the usage of additional micro-controllers, to save configuration space onto the FPGAs.

7.1 On Chip Switching Network

Clock Bit-width Speed

200MHz 8 1.267Gbit/s 200MHz 12 2.235Gbit/s 200MHz 26 4.843Gbit/s 100MHz 8 0.634Gbit/s 100MHz 12 1.118Gbit/s 100MHz 26 2.421Gbit/s

Table 7.1: variable speed of the OCSN

7.1.1 Physical Layer

At the physical layer always two network interfaces are connected to each other. Each interface transmits a full OCSN frame of 39bytes in one transfer. Using such large frames in one transfer often leads to transmission errors. In this case the network spans mostly over one FPGA, reducing the error probability approximately to zero. The simple approach of transmitting a full frame at once, reduces the area usage for each network interface. In this case, the advantage of reduced area usage outweighs the disadvantage. The 39bytes of each transfer are divided into a configurable number of bits, transmitted concurrently at each clock tick. The allowed bit-widths are {x : 312 mod x = 0}bits because 39bytes × 8bits = 312bits. Full duplex mode, by using dedicated transmission and reception lines, is also supported. The typical clock rates at this layer are 100MHz and 200MHz, resulting in the maximum network speed displayed in Table 7.1.

7.1.2 Data-link Layer

The data-link layer of the OCSN is responsible for detecting and identifying the remote device. To prevent overflowing of the receive buffer, it implements hardware flow control between the two directly coupled interfaces. If the receive buffer of one interface hits an upper bound, it signals the other interface to stop transmitting. If, after stopping the transmission, a bottom bound is reached, the interface request the continuing of the transmission.

The data-link layer of the OCSN does not provide any error detection/correction methods because the error probability, if configured onto a FPGA, is very small. But this feature can easily be added, if required.

7.1.3 Network Layer

The network layer defines everything required for routing OCSN frames through the network to the correct destination. Figure 7.2 displays the structure of one OCSN frame. It is build out of source and destination addresses, additional source and destination port fields, a frame type field and the payload of the frame. For the network layer the 16bit source and destination addresses are of interest.

7 Multicore Reconfiguration Platform Description

SRC Address DST Address

SRC Port DST Port Frame Type DATA 31 byte DATA

16bit 16bit

Figure 7.2: OCSN frame description

The network infrastructure components of the OCSN are OCSN switches. They

are organised in a tee structure to reduce routing complexity. A grid network would be faster and more flexible because different routes between two components exist, but would increase the routing overhead. A big disadvantage of a tree is its bisection width of one. Regardless of how you divide a network organised in a tree structure, the maximum connection number between two halves is always one. This leads to a big bottleneck, if components from one side have to communicate intensely with components on the other side. This disadvantage can be reduced by interconnecting all switches of one level in a ring, but this is not applicable in this network because the tree spans over multiple

FPGAs. Furthermore, most of the components in this network will communicate with

their direct neighbours. This communication will usually be taking place over one switch. All of these OSI layers have to be implemented in hardware, without the usage of additional micro-controllers. To generate this hardware with a very small area footprint, the advantages of simple routing outweighs the bandwidth disadvantages in this case.

An example OCSN , consisting out of OCSN switches only, is displayed in Figure 7.3. The example network is organised as a binary tree, but more outgoing edges per OCSN

OCSN Switch Root Switch: 1.0.0.0.0.0 OCSN Switch OCSN Switch 1.1.0.0.0.0 1.2.0.0.0.0 OCSN Switch OCSN Switch 1.1.1.0.0.0 1.1.2.0.0.0 OCSN Switch OCSN Switch 1.2.1.0.0.0 1.2.2.0.0.0

Figure 7.3: OCSN network structure overview

switch are also possible. Switches are only specialised network devices. This flexible design allows replacing switches by any other component and using switch ports for switches and devices without reconfiguring the system.

7.1 On Chip Switching Network

respond to the tree structure of the network. Therefore, the addresses are divided into the six parts shown in Figure 7.4. To support broadcast and unicast in the network, the first bit (r) of an address selects broadcast or unicast mode. The remaining bits are partitioned into five groups of three bits each. In the figure these groups correspond to the coloured characters a1a2a3. . . e1e2e3. If the value of r is one, the address 1.0.0.0.0.0

identifies the root node of the tree. Looking at Figure 7.3 the root node is the top switch. The switches generate the tree, while devices are leaves of the tree. Switches always own an address starting with a zero at their group.

The second group consisting of the bits a₁a2a3 and addresses all tree components

directly connected to the root switch. They are the second level components of the tree. The bits b₁b2b3 identify all components directly connected to switches of the second level,

like shown in Figure 7.3. This makeup goes on until group e₁e2e3, which identifies all

components connected to switches of the fifth level. The six level cannot hold any more switches because there are no addresses left. This limitation can easily be removed by extending the address space.

This addressing scheme enables all switches in the network to identify their uplink and downlink ports by checking the addresses of all connected devices. One advantages of a tree is the existence of only one route from one component to another. This eases the routing decision, to only identify the uplink of a switch and the calculation, to which of the connected switches the address belongs. Frames with a broadcast destination are transmitted to all ports, except the incoming one.

Because all frames in the OCSN have the same size of 39bytes, no framing or padding is required.

7.1.4 Transport Layer

To access the interconnected components, the network has to transport frames. In this scenario, the network is required to transmit configuration data, request status information, or access some kind of RAM . Because of the small error probability and the fact, that frames cannot be reordered while transmitted through the network, no connection oriented transport protocol is required. Instead, a connection less, UDP like, protocol is responsible for the data transport within the OCSN . The protocol features 8bit source and destination ports (Figure 7.2) and a 8bit frame-type field to identify the service at the destination. The maximum payload length is 31bytes. The frames are routed from source to destination using the network layer. If a service is listening at the destination on the destination port, the payload is processed and an answer is

r a 1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 e1 e2 e3

r=0 broadcast address r=1 unicast address

7 Multicore Reconfiguration Platform Description

transmitted.

7.1.5 Session Layer

The session layer starts and tears down connections in a connection oriented protocol. Because the transport layer of the OCSN only specifies a connection less protocol the session layer is not required.

7.1.6 Presentation Layer

Like in the TCP/IP suit the presentation layer is merged into the application layer. The main purpose of the merged presentation layer is, to ensure all information in an OCSN frame is in big endian byte order.

7.1.7 Application Layer

Accessing components in the OCSN requires different application layer protocols. The main distinction between these protocols is, if they require an answer frame or not. Usually it is enough to send one frame to a destination device to set registers or to request information. Still, the application layer defines the structure of the payload. Looking at the communication with an OCSN connected RAM the access mode (read, write), the access size (byte, word, double-word, . . . ) and the data for a write operation has to be encoded into the payload of an OCSN frame. In case of a frame send to a

BRAM connected to the OCSN the first byte of the payload identifies the operation

to perform. Bytes eight to five encode the RAM address and bytes twelve downto nine encode the dataword. In the answer frame from the BRAM the first byte signals what kind of answer this frame holds and bytes 8 downto 5 encodes the first data word. If more datawords are requested from the BRAM they are encoded after the first word.

7.2 Support Platform

The support platform combines all system resources of one FPGA board, including off-board extensions, into one platform. Using a distinct FPGA board, reduces the space requirements for the reconfigurable platforms because no additional hardware is required. The reconfigurable platforms can concentrate on providing reconfigurability. Figure 7.5 presents an example support platform with all supported FPGA resources. These resources are connected through an interface to the OCSN . At the moment the following components are supported:

• GPIO • BRAM • DDR RAM

7.2 Support Platform

FPGA - support plattform

OCSN Switch Uplink Ethernet/ Uart Downlink Ethernet/ Uart BRAM DDR RAM GPIO Softcore SoC

Figure 7.5: Example support platform

In addition an uplink and downlink device exist, to connect a host system or other platforms to this FPGA. Two alternative devices are available. One UART and one Ethernet based bridge.

7.2.1 GPIO

For querying and inserting debug data out of/into the OCSN , the GPIO component is very helpful. Outgoing GPIO signals can be set to certain values and drive, for example Light Emitting Diodes (LEDs). By sending status request frames the settings of a connected Dual Inline Package (DIP) switch can be checked using the pulling approach. It would be possible to implement interrupts by sending an OCSN frame out, if a DIP switch changes its status.

7 Multicore Reconfiguration Platform Description

7.2.2 BRAM

The FPGA used for the support platform has BRAM resources left, after using much of it for buffers in the OCSN . These BRAM can be combined to form a BRAM OCSN device. It allows access to the RAM from the OCSN with different access modes. The following access modes are supported at the moment:

READ{length} read a data word of length bytes WRITE{length} write a data word of length bytes

SWAP{length} atomic swap of a data word of length bytes

The supported number of bytes for length are: 4,8,16,32,64 and 128 bytes. For initialising the RAM , two commands are available:

INIT ZERO initialise the RAM from a given start address and some 4 byte words with “00000000000000000000000000000000”

INIT ONE initialise the RAM from a given start address and some 4 byte words with “11111111111111111111111111111111”

The following commands are planed as future extensions to support concurrent access to the RAM from different OCSN devices.

LOCK lock the device for use by the source of this command only

UNLOCK unlock the device for use by everyone, only possible from same device, which send the lock command or some master device to prevent a deadlock

LOCK RANGE lock part of the address space for use by the source of this command only

UNLOCK RANGE unlock a previously locked address space LIST LOCKS list all enforced locks

7.2.3 DDR3 RAM

This component uses the same interface and access model like the BRAM device. The

In document Soberanía alimentaria y cooperación al desarrollo en el marco de la OCDE (página 84-88)