Desing of a Network-On-Chip platform for MPSoCS using TLM 2.0 standard and FPGA implementation

Texto completo

(1)Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGA implementation. Fernando Adolfo Escobar Juzga. Electric and Electronic Engineering Department. APPROVED:. Antonio Garcı́a Rozo, Ph.D.. Mauricio Guerrero, MSc.. Alain Gauthier, Ph.D. Dean of Faculty.

(2) to my MOTHER, FATHER and SISTERS with love.

(3) Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGA implementation. by. Fernando Adolfo Escobar Juzga. Thesis Presented to the Academic Faculty of the Graduate School of Universidad de los Andes, Bogotá in Partial Fulfilment of the Requirements for the Degree of. Master Of Electronic Engineering. Electric and Electronic Engineering Department Universidad de Los Andes January 2011.

(4) Acknowledgements I wish to thank my advisers Antonio Garcı́a and Mauricio Guerrero for their guide and support throughout the project; it was their experience and knowledge what helped me choose and love this research area from years before. To my parents and sisters that have unconditionally supported me at all times and without whom I wouldn’t have got here. Additionally I want to thank all my friends who continuously inspire and demonstrate me how far can one go with hard work, dedication and passion. This thesis wouldn’t have been possible without the support of the OSCI TLM working group and all its members. Finally, I want to thank CMUA group for providing me with the necessary resources and tools that were required.. iv.

(5) Abstract Complex systems that include a great variety of modules inside the same dice require higher level design techniques that allow obtaining accurate models suitable to test hardware as well as software at early stages; multiprocessors Systems On-Chip (MPSoCs) are scaling to levels where it is possible to embed tens and up to hundreds of cores on the same chip. Such architectures cannot be integrated with traditional bus structures as they are not scalable; as a solution to that, a new paradigm called Network on Chip (NoC) has gained strength to solve this issue. SystemC, an IEEE standard for electronic level design (ESL) is used here to build a NoC functional model; to simplify hardware details and speed up simulations, the new Transaction Level Modelling standard (TLM 2.0) is also adopted. Relying on different design constrains, variables such as router and network interfaces architectures, routing algorithms, message and flit size, etc, are evaluated. At a final stage, a VHDL synthesis is done and compared with other implementations. Results prove this design flow to be adequate and helpful for this kind of systems due to its size and complexity.. v.

(6) Table of Contents Page Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. iv. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. v. Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. vi. List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. viii. List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ix. Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1 Networks On Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 1.1. Parallel Computing Memory Model . . . . . . . . . . . . . . . . . . . . . .. 4. 1.2. Networks On Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.2.1. Physical Layer: Topology. . . . . . . . . . . . . . . . . . . . . . . .. 6. 1.2.2. Data Link Layer: Flow Control . . . . . . . . . . . . . . . . . . . .. 7. 1.2.3. Network Layer: Switching Policy and Routing Algorithm . . . . . .. 10. 1.2.4. Transport Layer: Network Interface Card . . . . . . . . . . . . . . .. 13. 1.3. SystemC and Transaction Level Modelling TLM 2.0 . . . . . . . . . . . . .. 17. 1.4. Open Core Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 2 NoC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 2.1. Flit and Message structure . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 2.2. Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 2.2.1. Router TLM Model . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31. 2.2.2. Traffic Evaluation and Routing Algorithm Testing . . . . . . . . . .. 33. 2.2.3. Router VHDL Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. Network Interface Card Architecture . . . . . . . . . . . . . . . . . . . . .. 47. 2.3.1. 49. 2.3. Network Interface TLM Model . . . . . . . . . . . . . . . . . . . . .. vi.

(7) 2.3.2. Network Interface VHDL Model . . . . . . . . . . . . . . . . . . . .. 55. Software Performance Results . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 2.4.1. 4 × 4 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . .. 59. 2.4.2. 8 × 8 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . .. 60. 3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63. 2.4. 3.1. Significance of the Result . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64. 3.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 66. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. vii.

(8) List of Tables 1.1. Flow Control Techniques for NoCs. . . . . . . . . . . . . . . . . . . . . . .. 9. 1.2. Generic Payload Attributes . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 1.3. Basic OCP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 1.4. Burst OCP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 2.1. Flit fields explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.2. Router Arbitration Techniques . . . . . . . . . . . . . . . . . . . . . . . . .. 28. 2.3. TLM 2.0.1 Phases Interpretation for Routers . . . . . . . . . . . . . . . . .. 31. 2.4. Router Area Consumption on Virtex5 (XC5VFX30T-1FF665) . . . . . . .. 49. 2.5. VHDL-SystemC equivalence of NIC blocks . . . . . . . . . . . . . . . . . .. 56. viii.

(9) List of Figures 1. OSI Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.1. Shared Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.2. Distributed Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.3. Common Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 1.4. Ad-Hoc Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 1.5. Packet Switching on NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 1.6. Guidelines for selecting a Routing Algorithm . . . . . . . . . . . . . . . . .. 12. 1.7. Turn model for Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . .. 14. 1.8. West First Routing Examples . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 1.9. North Last Routing Examples . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 1.10 Negative First Routing Examples . . . . . . . . . . . . . . . . . . . . . . .. 15. 1.11 Transaction Level Modelling Use Cases, Coding Styles and Mechanisms . .. 18. 1.12 TLM Transaction Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 1.13 TLM Base Protocol Phases. . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 1.14 OCP Read Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 1.15 OCP Burst Write Transaction . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.1. Head Flit Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.2. Message Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.3. Torus Topology NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28. 2.4. Block diagram of a Router with Virtual Channels . . . . . . . . . . . . . .. 29. 2.5. Virtual Channel connections to Router . . . . . . . . . . . . . . . . . . . .. 30. 2.6. General Router Block Diagram . . . . . . . . . . . . . . . . . . . . . . . .. 30. 2.7. Link Utilisation for Hotspot 10 %. Traffic going Right - Down . . . . . . .. 38. 2.8. Link Utilisation for Hotspot 10 %. Traffic going Left Up . . . . . . . . . .. 39. ix.

(10) 2.9. Timing statistics for Hotspot 10 %. Traffic . . . . . . . . . . . . . . . . . .. 40. 2.10 Link Utilisation for Matrix Transpose Traffic going Right - Down . . . . .. 41. 2.11 Link Utilisation for Matrix Transpose Traffic going Left Up . . . . . . . . .. 42. 2.12 Timing statistics for Matrix Transpose Traffic . . . . . . . . . . . . . . . .. 43. 2.13 VHDL Router Black Box . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 2.14 VHDL Block Diagram for the XY Routing Module . . . . . . . . . . . . .. 46. 2.15 VHDL Block Diagram for Input Port Module . . . . . . . . . . . . . . . .. 46. 2.16 VHDL Block Diagram for Output Port Module . . . . . . . . . . . . . . .. 47. 2.17 VHDL Block Diagram for the Router . . . . . . . . . . . . . . . . . . . . .. 48. 2.18 TLM Phases in a NIC Read Operation . . . . . . . . . . . . . . . . . . . .. 53. 2.19 TLM Phases in a NIC Write Operation . . . . . . . . . . . . . . . . . . . .. 54. 2.20 VHDL Block Diagram for Network Interface Card . . . . . . . . . . . . . .. 57. 2.21 State Machine for the Handshaking Control . . . . . . . . . . . . . . . . .. 58. 2.22 State Machine for the NIC End to End Flow Control . . . . . . . . . . . .. 59. 2.23 NoC Performance for a 4 × 4 Matrix Multiplication . . . . . . . . . . . . .. 60. 2.24 NoC Performance for a 8 × 8 Matrix Multiplication . . . . . . . . . . . . .. 61. 2.25 Total simulation time at each node with North Last Routing . . . . . . . .. 61. 2.26 Total simulation time at each node with West First Routing . . . . . . . .. 62. x.

(11) Introduction Multiprocessor systems on chip (MPSoCs) are becoming ubiquitous platforms in current devices; large integration of modules has permitted the successful creation of multi-core architectures (2 to 20 cores) in the past, and now provides the means and technology for developing the so called many-core ones (hundreds of processors). This platforms require however, better design practices in order for both hardware and software entities to be ready for release on time. One of the most recent proposals for creating complex embedded systems and many-core platforms is known as Networks on Chip (NoC); instead of utilizing bus systems, interconnections between components are done through routers and Network Interface Cards (NIC); whilst routers are in charge of transporting data throughout the chip, NICs gather information from/to end-modules (ports, memories, cores, etc) and send it to the router’s network for delivery. In spite of being hardware, due to it’s complexity, a NoC can be better designed from higher levels of abstraction rather than traditional RTL/HDL; new languages such as SystemC are more appropriate for this task as they can be used to rapidly create software representations of hardware (known as virtual platforms)[3]. SystemC is currently an IEEE standard for high level modelling with which C++ descriptions of hardware platforms can be made; it includes a release called Transaction Level Modelling (TLM 2.0) that was designed for speeding up the development of embedded platforms as it simplifies unnecessary physical communication details such as clocks or pin-out specifications; in addition to all C++ features, SystemC and TLM 2.0 provide libraries that ease the emulation the real platform with as much timing details as needed. To this date, however, most NoC designs are conceived from lower abstraction levels; according to [1], simulation and synthesis are the most common ways to evaluate them; only a few have been synthesised as ASICs and the rest is implemented on FPGAs. High level models of NoCs have also been proposed: in [6] a 4x4 mesh network is built and simulated with. 1.

(12) SystemC; [7] creates C++ libraries to make a simulator for NoCs; reference [8] designs a 6x6 mesh network and tests it with SystemC and VHDL before implementing it on an FPGA; basic, low level SystemC description of a scalable NoC, is presented in [9] and is validated with an MPEG encoder. Loads more of proposals can be found in [1] and [2] yet none of them use the TLM 2.0 standard for design. As suggested by the some of the previous references and specially by [5], higher abstraction levels are needed when constructing this architectures; to see this, consider the specifications for network design defined by the OSI reference model [4]; although not all of OSI’s layers have a direct equivalence to NoCs, most its principles can be extrapolated to this field. In Figure 1, the OSI protocol stack is shown; for this work, the three upper layers can be joint into a single one. From the figure, dependency between hardware and software is evident when reading it from top to bottom; although it is possible to separately test each group, SystemC descriptions and a correct application of the TLM standard can help co-designing the NoC as a whole and iteratively improve it on both aspects. Low level layers of the protocol define all related with routers: architecture, routing algorithms, flow control, switching techniques, etc; higher ones determine the NIC’s structure. Because bus systems are the medium through which processors and most peripherals transfer information, NICs use them to interchange data with end modules; due to the great variety of bus specifications, this work considers only the most common ones, that is, AMBA from ARM [11] and OCP [10] from the OCP-IP group; the latter was selected for its simplicity and high support from the ESL community. On the other hand, and as previously stated, top layers of the OSI model can be summarized into a greater group that refers to software models necessary to access the network; there are mainly two approaches in the field of multi-processor programming: shared or distributed memory. As its name indicates, shared memory implies that processing units can access the same physical or logical memory spaces at any time; a well known API implementation of this, is called OpenMP [12], and is lead by a non-profit corporation, composed of several companies and researchers, named OpenMP-ARB. The distributed model, on. 2.

(13) Figure 1:. OSI Protocol Stack. Networks are usually defined according to the layers shown.. the contrary, assigns separate physical memory sectors for each unit and through message passing, data is shared between all modules at any moment. One of the first API implementations for this protocol is called MPI [13] and is still developed by the MPI Forum. Message passing has had wide application for computer networks and appeals better suited for them as computers don’t share the same memory. For the purposes of this work, some of the MPI specifications were adopted for NIC design. The following sections will provide a better insight on all the topics already mentioned; a functional TLM 2.0 model of a Network On Chip is proposed and validated through simulations; traffic patterns and other NoC parameters are analysed with the design and finally a VHDL synthesis is presented to evaluate area consumption.. 3.

(14) Chapter 1 Networks On Chip Design To properly board NoC design, several aspects need to be defined in terms of the aforementioned OSI model. Through the evaluation of each layer, all aspects needed for our high level model will be defined.. 1.1. Parallel Computing Memory Model. Parallel computing has faced big challenges since its creation: task dependencies, race conditions, mutual exclusion and parallel slowdown are concerns that can’t be omitted. Whether shared or distributed memory, these are software aspects that have to be solved at high level so represent an additional task for the programmer. Because none of the previous issues marks any difference on the memory model to be used, it is necessary to consider elements that clearly affect this decision: portability and scalability. A shared memory configuration is shown in Figure 1.1; if a program is to be run on such platforms, either a software compiler, aware of all system resources, has to be provided along with the hardware, or the programmer has to know about low level details to write an application for it. Apart from that, if the number of cores is modified, the cost at software level shouldn’t be expensive at all, and again, it will depend either on the compiler or the programmer. Distributed memory shown in Figure 1.2; as indicated, processors interchange information through messages. In contrast to the previous approach, no additional compiler or deep knowledge about the hardware is needed; only network accessing methods are required. In case the number of cores change, a correct parametrized software description would solve. 4.

(15) the problem. Loads more of pros and cons on each configuration could be mentioned but it goes beyond the scope of this work; it will suffice by stating that the distributed memory model better suits the NoC’s behaviour and is the one implemented here.. Figure 1.1:. Figure 1.2:. 1.2. Shared Memory Model. All processing elements share a big memory area; each core may have as many caches as desired yet the main memory is common to all of them.. Distributed Memory Model. Interconnections between cores is done through a network; if data has to be shared, it is sent via message passing.. Networks On Chip. Designing Networks On Chip is a process that requires consideration of several variables in order to separate communication from computation. The OSI model shown in Figure 1. 5.

(16) can be taken as a reference for this systems; to understand this association, each level of the stack can be defined as follows: 1. Physical Layer: Defines voltage levels, length and width of wires, timing details and topology among others. 2. Data Link Layer: Its the one in charge of safe data delivery; specifies flow control mechanisms between hardware modules. 3. Network Layer: Controls message delivery from one node to another. It’s responsible for storing data and implementing routing algorithms. 4. Transport Layer: Is in charge of establishing connections between end-nodes and provide the information for them. This module (un) packages data and send (receive) it to (from) the routers. 5. Session, Presentation and Application Layers: Can be condensed into a single Application group for NoCs and refers to higher level aspects of the communication such as software. By following the above mentioned scheme, it is possible to define a functional and synthesizable NoC model considering all it’s aspects. Although a high SystemC model of the network is constructed, hardware details are considered for future implementation. The following sections show the specifications on each layer for the model developed.. 1.2.1. Physical Layer: Topology. Some aspects of the physical layer depend on the technology to be used for fabrication and can’t be specified from the beginning; operating frequency and voltage levels are examples of such limitations; because synthesis is not the main target of this work, the previous items were discarded. Bus width was selected to match most standard processors nowadays, that is 32 bit ones. 6.

(17) Another important issue and perhaps the most relevant on this layer, is the topology; contrary to computer networks, NoCs have a fixed structure that cannot be modified for the rest of the chip’s lifetime. On this subject, several configurations have been proposed: Figure 1.3 illustrates the most common topologies for networks; SPIN [14], Mesh, Torus, Folded Torus, Octagon and Trees are a few examples. According to [1], Mesh and Torus topologies constitute 62% of the overall designs; trees represent 12% and the rest, smaller percentiles. There are as well specific ad-hoc implementations that can be seen in Figure 1.4; addition of links and combination of basic structures constitute the differences; despite reducing worst case paths or improving latency delays, the cost on area consumption and creation of new routing algorithms might be too high. The guidelines to pick a topology were its scalability and the availability of routing algorithms. As mentioned before, Mesh and Torus structures are used by the majority of researchers and it is mainly because of their scalability: the cost of adding one or two cores to a grid is pretty low as it doesn’t critically change the structure, additionally, routing algorithms need not to be modified. Differences among them are turnaround links that can significantly reduce some of the worst case conditions. In reference [18], both structures show similar behaviour in power consumption, throughput and saturation, but Torus topologies perform better with adaptive routing algorithms which, as will be seen on the next section, are needed. After considering all previous restrictions and the results of the cited references, Torus topology was selected for this work.. 1.2.2. Data Link Layer: Flow Control. Routers are complex modules that have simple handshaking protocols to transfer data. Whether interacting with another router or a network interface card (NIC), the mechanism is the same. Again, some differences when compared to computer networks, exist; modules inside the same chip transmit data in a much more reliable way than physically separated ones so it suffices with controlling when to send and receive information, assuming that it is properly transmitted. Some router implementations such as Æthereal [19] or MANGO[20] 7.

(18) Figure 1.3:. Figure 1.4:. Common Network Topologies. Both for computers networks and NoCs, most common structures are shown in the graphic.. Ad-Hoc Network Topologies. Academic proposals for NoC topologies: Mesh Connected Crossbars [16](left), Spidergon [17] (center), and Diagonal Mesh [15] (right).. 8.

(19) offer Quality of Service (QoS) guarantees but it requires a highly and specilized work that goes beyond the scope of this work.. Flow control techniques are shown in Table 1.1; most implementation use the Credit Based approach; STALL/GO has never been implemented and the rest of the literature use handshaking and ACK/NACK like solutions. The handshaking approach is adopted for our design. It is important to note that flow control on the NoC’s SystemC model is abstracted with the TLM 2.0 standard and may correspond to any of the techniques available when ready for synthesis. Table 1.1: Flow Control Techniques for NoCs [2]. Name Credit Based. Description Every router keeps an internal counter of the spaces available for data storage (credits); once a new space is free, a credit is sent back to inform its availability.. Handshaking Signal Based. A VALID signal is sent whenever a flit is transmitted. The receiver acknowledges by asserting a VALID signal after consuming it.. . ACK/NACK. A copy of a data packet is kept in a buffer until an ACK is received; if asserted, the flit is deleted. If a NACK signal is asserted, the flit is scheduled for retransmission.. STALL/GO. Two wires are used for flow control; When a buffer space is available, a GO signal is activated. When no space is available, a STALL signal is asserted.. 9.

(20) 1.2.3. Network Layer: Switching Policy and Routing Algorithm. Switching policy determines the way information is transmitted; it can be either packet or circuit switched. Circuit Switching is the least implemented and states that a path from source to destination must be reserved before transmitting data and shall only be released after the message has been fully delivered. This policy is time expensive and may increase network congestion because messages can be blocked for long time if data is big; such situation may easily lead to deadlock issues. Packet switching is widely used in both computer networks and On-Chip ones; it can be implemented on either the following three versions: 1. Wormhole: Packets are splitted into smaller ones called flits (Flow Control Units). Head flits contain address’ information and each router uses it for forwarding it to the destination; body flits follow it in a worm-like way. Only a 1-flit space is necessary on each router input for implementation. 2. Store and Forward: Routers accept and send data when there is enough capacity for fully storing the packet. A minimum space equal to the packet’s maximum length, is required per router. 3. Virtual Cut Through: Data is transmitted per flit but is only accepted when there’s enough buffer space for saving the whole packet; all routers must be able to store at least the maximum’s packet length. Figure 1.5 illustrates how information is transmitted through packet switching techniques; around 80% of proposed NoCs, implement the wormhole one because of its low-area requirements; wormhole switching was also selected for this work given those advantages.. Another item addressed by the Network Layer that highly affects the platform’s performance is the routing algorithm; because of Torus resemblance with Mesh arrangements,. 10.

(21) Figure 1.5:. Packet Switching on NoCs: Wormhole(left), Store and Foward (center) and Virtual Cut Through (right) [19]. Only the wormhole technique significantly reduces area consumption.. most algorithms that work for it, may as well operate on Torus networks with minor modifications. A good guideline for selecting an appropriate algorithm, irrespective of the structure, is the scheme shown in 1.6; several router implementation details can be established from that graph: router complexity increases with the number of destinations it can deliver information to. Due to area restrictions and the possibility of solving it at the software level, multicast routing is discarded for the current work. Routing decisions also determine the chip’s design: centralized routing requires a controlling entity, aware of all nodes and traffic throughout the network, to decide how should the information traverse it; source routing might increase the packet’s size for long paths and finally multiphase routing also implies some of the previous problems. Distributed routing is by far the most suited for NoCs and facilitates the adoption of the algorithms proposed. As for implementation, both lookup tables and FSM are feasible to adopt; area cost on both options is similar and don’t affect the design drastically; one variable that could determine which to choose is whether the algorithm is deterministic (always the same path between two nodes) or adaptive (relies on network congestion). Thanks to the fact that a high level model of the network will be created, tests are to be carried on with deterministic and adaptive algorithms; adaptive ones are be backtracking (fault tolerant), mis-routing. 11.

(22) (can route away from the destination if necessary) and partial (don’t consider all possible routing paths).. Figure 1.6: Guidelines for selecting a Routing Algorithm [2]. For grid-like structures the most common deterministic algorithm is the XY one, where information travels in the X direction until it reaches the Y coordinate of the destination; it then travels in the Y direction. Adaptive routing is more complex as it attempts sending data through low congested paths that aren’t always minimal; because of that, two conditions that usually restricts the algorithms adaptability are deadlock, where several messages block each other’s path preventing themselves to ever advance, and livelock where data keeps travelling throughout the chip without ever reaching the target. A few semi-adaptive, deadlock and livelock free algorithms widely adopted are known as turn model solutions [21], [22]; from all possible 90◦ turns, 2 are prohibited in order to avoid deadlock. Figure 1.7 shows three algorithms inferred with this theory. To better. 12.

(23) understand each one, a brief explanation, taken from [23], is presented: ? West-First: Packets should start going to the west if necessary, then, adaptively are routed south, east and north. Prohibited turns are the two to the west. Figure 1.8 shows some path examples with this algorithm. ? North-Last: When going north, packets can’t turn anywhere else; the only option for packets to go northwards is when that is the last direction to take. Examples are shown in Figure 1.9. ? Negative-First: Prohibited turns are the two from a positive direction to a negative one; if a packet has to go in the negative direction, it must start in that direction. Figure 1.10 exemplifies this behaviour. Any of the aforementioned algorithms can be used with the SystemC model of the network as describing them doesn’t require much development time; studies shown in [18] demonstrate that no significant difference among them exist. Other algorithms have been proposed in [24], [25], [26], [27] and many more references but will be left for future work.. 1.2.4. Transport Layer: Network Interface Card. Up to this point, most design specifications affected the router’s final structure, however, this layer has more implications on the Network Interface Card; problems to be solved at this level are end-to-end flow control and (un)packing of information. In order to control packet injection on the network, our NIC design is based on the message passing model previously mentioned; the way processors intercommunicate with each other can be summarized in two activities: sending and receiving data; for each message transmitted by a core (write operation), another one should be expecting it (read operation).. 13.

(24) Figure 1.7:. Turn model for Adaptive Routing: Two turns are prohibited on each model to avoid deadlocks; minimal and non-minimal paths are possible from all options. [22]. Figure 1.8: West First Routing Examples [23].. Figure 1.9: North Last Routing Examples [23].. 14.

(25) Figure 1.10: Negative First Routing Examples [23]. It is clear that processors won’t be synchronized at all times, and at a certain point, two or more cores could send messages to another that isn’t ready yet; this will only increase network congestion, require retransmission protocols, message discard support, and might also lead to a deadlock at high level if not properly solved.. Considering the indicated problems and especially area constrains, the proposed Network Interface Card implements end-to-end flow control with the following protocol: when a core requests data, that is, performs a read operation, it sends a 1-flit-size packet to the core that is intended to write on it; upon reception, the second NIC sends the information only if the second core has a pending write transaction that matches the requester’ address; if the second core doesn’t expect that specific request, discards it, and the first one has to retry after some time. On the other hand when a NIC receives a write transaction, it starts packing data, so that when a request arrives, most if not all information is ready to be transmitted; if an application is properly written, the number of read requests should match the number of write statements.. The cost of such implementation is that for every read/write pair, at least one flit has to be sent between two nodes in order to “establish” a connection; this is nonetheless, far more efficient than allowing all cores to send their packets any time and oblige NICs to constantly delete them if they don’t correspond to expected transactions.. 15.

(26) Other important items regarding NIC end-to-end flow control behaviour are: 1. No read transactions requested by a processing element are accepted by the NIC while another read is in progress; violation of the algorithm sequence can lead to incorrect results. 2. If data is being transmitted, the NIC can accept a read transaction from the processor but won’t send the request until the previous transaction has terminated. 3. If a NIC is receiving packets from the network (read transaction), a write transaction can be started from processor to NIC; data can be stored at a send buffer but won’t be sent until a request from the correct module is received. 4. A write transaction starts when a processing element sends data to the NIC for transmission. For the processing element, it ends when all the information has been transferred to the NIC; for the latter, when all flits have been injected into the network. 5. A read transaction starts when a processing element requests data from the NIC; it ends when all the information requested is successfully delivered from the NIC to the processing element. 6. Irrespective of the type of transaction a NIC is performing, under any circumstances can it skip the execution order when another read/write transaction is received. 7. Buffer size for storing incoming and outgoing transactions was defined to be of 64 words. Separate buffers are implemented to improve performance. As stated before, the protocol used for communication between the NIC and the processing elements is the OCP-IP one; because it belongs to another section, it is not explained here.. 16.

(27) 1.3. SystemC and Transaction Level Modelling TLM 2.0. Transaction Level Modelling TLM is a standard developed by the Open SystemC Initiative (OSCI) which provides tools to rapidly create virtual descriptions of embedded platforms; it’s main objective is to decouple computation from communication at a high abstraction level so that complex systems can be modelled. According to the OSCI group [28], simulations run from 10X up to 1000X faster than corresponding HDL descriptions. The TLM 2.0 standard allows two coding styles: loosely timed (L.T) and approximately timed (A.T). When a quick and slightly detailed model of a design is required, the loosely timed approach can be adopted; L.T transactions are modelled as a single function call (read or write) that either returns after some delay, or do it immediately with an additional delay argument so that the caller reacts after that time. A.T descriptions, on the contrary, provide mechanisms for specifying as much timing details as desired so are more suited for architectural analysis and hardware verification. The Network On Chip model developed here only uses A.T descriptions and therefore an emphasis on explaining it is made. Figure 1.11 shows a bigger context where it’s worth applying TLM 2.0.1 descriptions. The basic unit in all TLM transactions is the object interchanged, the generic payload ; it’s a C++ class which members include the minimum elements to execute a transaction: command, address and data; apart from those, additional variables such as byte enables, streaming width, bus width, response status, etc, are included to model more complex protocols. Generic payload objects also support user defined extensions that can carry an unlimited number of attributes if required. Table 1.2 explains the basic attributes aforementioned. All TLM 2.0.1 transactions are carried out between an Initiator and at least one Target; the channel through which they communicate is called a socket and the only module allowed to start transactions is the Initiator; Target modules can just reply to in-progress transactions; Interconnect modules (such as routers or buses) can also be integrated with. 17.

(28) Figure 1.11: Transaction Level Modelling Use Cases, Coding Styles and Mechanisms [28]. the previous ones. Figure 1.12 shows an example of one Initiator, one Interconnect component and a Target.. AT transactions can be split into 4 phases as shown in Figure 1.13; through functions named non-blocking forward transport (nb forward ) and non-blocking backward transport (nb backward ), communication takes place; both functions have three parameters: 1. Trans: Pointer to the generic payload object. 2. Phase: Current transaction phase; it can be either of those shown in Figure 1.13. 3. Delay: Time that a module has to wait before responding to a transaction. Initiators call nb transport forward, with BEGIN REQ as phase argument, to start transmitting data; they use phase END RESP to conclude a transaction. Targets call. 18.

(29) Table 1.2: Generic Payload Attributes according to [29]. . Generic Payload Attribute. Meaning. Command. Can be either Write or Read.. Address. Target address to execute transaction.. Data Pointer. Pointer to the data array. Data should be read or written to this variable.. Data Length. Length of the data to be transferred computed as BUSWIDTH/4;. Byte Enable Pointer. Used to enable access to specific data bytes.. Byte Enable Length. To specify the number of valid elements of the byte enable pointer.. Streaming Width. States the number of words per burst transfer.. DMI Allowed. Marks whether the Direct Memory Interface can be used or not.. Response Status. Used for storing the status of the transaction.. nb transport backward with phase END REQ to acknowledge the reception of a transaction and use phase BEGIN RESP to indicate the correct execution of the it regardless whether is read or write. At some points it might be unnecessary to use all four phases to model a platform’s behaviour, i.e. when a write transaction is performed: an initiator(cpu) sends data to a target (memory) which can execute the order immediately; in this case, the target can reply to the initiator with a phase update, changing it from BEGIN REQ to BEGIN RESP and adding some delay; the way each agent is aware of such status updates is by checking the return value of a nb transport call. Return values can be either of TLM ACCEPTED (no change in phase), TLM UPDATED (phase updated) or TLM COMPLETED (transaction. 19.

(30) executed).. Specific rules concerning each module’s permission to modify the generic payload attributes, possible return values from each nb transport call, and detailed explanation of the whole standard, can be found on [29] for more information.. Figure 1.12:. TLM Transaction Flow [28]. The generic payload object is created by the Initiator but is only referenced by interconnection modules or targets. Socket arrow’s indicate how the information flows.. As previously mentioned, some extensions can be added to the generic payload object for routing purposes and can be either global or instance specific, that is, each module of can add attributes to the transaction object and being the only one able to access them; this work adds two extensions to the generic payload object: a global one for end-to-end verification purposes, and an instance specific one for router operation. Next chapter will show more details about this.. 20.

(31) Figure 1.13:. 1.4. TLM Base Protocol Phases [28]; the initiator is the vertical line on the left and the target the one on the right.. Open Core Protocol. The Open Core Protocol International Partnership (OCP-IP) is a community in charge of “proliferating a common standard for intellectual property core interfaces, or sockets that facilitate “plug and play” System-on-Chip design”[10]. Their specifications for interconnecting modules is a bus model as complete as ARM’s AMBA-AXI and can be perfectly described with OSCI’s TLM 2.0.1 standard. Because of the amount of details the OCP has, a light version of it will be used for this work; all basic signals shown in Table 1.3 are used but additional burst support is included. Standard OCP burst extension require 8 additional signals where all but MBurstLenght can be skipped; to see how can this be done, consider Table 1.4: MAtomicLength is used when the length of data is bigger than the word size and this is not the case; MBurstPrecise indicates that the length of the burst is known at the start of the transmission as always is for our design; MBurstSeq specifies how are the addresses of the burst emitted which in this work are assumed to be incrementing; MBurstSingleReq implies that only one request is done per burst transfer; MDataLast, MReqLast and MRespLast are unnecessary as each module keeps track of the number of data transferred.. 21.

(32) Table 1.3:. Basic OCP Signals extracted from [10]. Signal MDataValid is skipped in our implementation. Width measured in bits.. Name. Width. Driver. Function. Clk. 1. varies. OCP Clock. MAddr. configurable. master. Transfer address. MCmd. 3. master. Transfer command. MData. configurable. master. Write data. 1. master. Write data valid. MRespAccept. 1. master. Master accepts response. SCmdAccept. 1. slave. Slave accepts transfer. SData. configurable. slave. Read data. SDataAccept. 1. slave. Slave accepts write data. SResp. 2. slave. Transfer response. . MDataValid. Table 1.4: Burst OCP Signals [10]. Only MBurstLenght is enough for this work’s NIC.. .. Name. Width. Driver. Function. MAtomicLength. configurable. master. Length of atomic burst.. MBurstLength. configurable. master. Burst Length.. MBurstPrecise. 1. master. Burst length precise.. MBurstSeq. 3. master. Address sequence.. MBurstSingleReq. 1. master. Single. request/multiple. data protocol MDataLast. 1. master. Last data in burst.. MReqLast. 1. master. Last request in burst.. SRespLast. configurable. slave. Last response in burst.. 22.

(33) To better understand how transfer with the OCP protocol work, consider Figure 1.14; only signal MRespAccept is missing on the diagram yet the behaviour is practically the same. Figure 1.15 shows an scenario for burst transfers, handshaking is carried on the same way.. Figure 1.14:. OCP Read Transaction [10]; signal behaviour when performing a read request: When the master issues the command it has to wait for SCmdAccept to assert before changing the MCmd line. After some time the slave indicates valid data on the SData bus by issuing a Data Valid command on the SResp line.. 23.

(34) Figure 1.15:. OCP Burst Write Transaction [10]; signals MBurstSeq and MBurstPrecise never change. Handshaking between master and slave is basically the same as the previous non-burst example.. 24.

(35) Chapter 2 NoC Implementation Once the implementation details and design flow have been clarified as in the previous chapter, its now possible to describe the router and the NIC at any level of abstraction. Although most items regarding each structure are well defined, some aspects still lack specification and will be analysed hereafter. Code of each description can be found on the Appendix section.. 2.1. Flit and Message structure. In order to determine the structure of both the router and the NIC, it’s necessary to define the units they are going to deal with: Flits and Messages. Messages are composed of one or more flits, which are the units injected into the router’s network; because wormhole routing is to be used, one of the flits must include information about the origin and destination of the whole message; NICs, however, require additional data fields to properly implement end-to-end flow control. As a start, a review of explanations provided on Section 1.2.4 and the constrains mentioned on Section 2.3 are necessary to define all constrains.. If more TCP-like control parameters are needed for high level control, those parameters must be set by processing elements and are to be transmitted to the NIC as common data; NICs only support the minimum amount of control fields to ensure correct functionality. Head Flit structure is displayed in Figure 2.1 and message structure in 2.2. Flit fields are explained in Table 2.1.. 25.

(36) Figure 2.1: Head Flit Structure. Figure 2.2: Message Structure. Payload can be up to 64 bytes long.. Table 2.1: Flit fields explanation Field. Use. Type. Flits can be either: Head, Body, Tail or Single; single flits are used to ask for data and for barrier operations.. .. Source X. Flit’s origin X coordinate.. Source Y. Flit’s origin Y coordinate.. Destination X. Flit’s destination X coordinate.. Destination Y. Flit’s destination Y coordinate.. Length. Message length. Maximum 64 words.. Single. Indicates whether flit is a single-flit transaction or not.. Message Number. Message number stated by source module.. Broadcast. States whether the message is broadcast or not.. BarrierID.. Stores a BarrierID according to the source.. ReadWrite. If message is single-flit, this bit is set when is a barrier write.. 26.

(37) Through SystemC descriptions and simulations it was possible to establish the correct behaviour of the platform. Now that the units needed by both router and NIC are defined, their designs can be presented.. 2.2. Router Architecture. Studies presented in Chapter 1 yielded the following conclusions regarding router implementation: ? Topology: Torus. Displayed in Figure 2.3. Taken from [30]. ? Switching Policy: Wormhole Packet Switched. ? Flow Control Technique: Handshaking Signals. ? Routing Algorithms: Deterministic XY and Adaptive Turn Model. Only two aspects about the router’s structure are still undefined: Arbitration techniques and number of Virtual Channels. When two ore more inputs attempt to use a router’s output it is necessary to establish a mechanism to assign output control. Table 2.2 lists usual solutions to this problem; most implementations listed in [2] use Round-Robin or First Come - First Served techniques for Best Effort routers and priority approaches for Guaranteed Traffic (GT) ones such as [19] and [14]. Specialized routers are required when GT services are to be provided; just a few NoCs like the ones cited have implemented GT services. Best effort Round Robin arbitration will be used on this work. On the other hand, Virtual Channels (VCs) are buffer additions to the router’s inputs (outputs) used for alleviating congestion on the network; despite using the same physical paths, addition of buffers decrease the probability of deadlock and improve performance as delayed messages can hold on routers and still advance to their destinations. Area is the main cost of adding Virtual Channels and is also one of the most critical issues in 27.

(38) Figure 2.3: Torus Topology NoC. Table 2.2: Router Arbitration Techniques [2]. Arbitration Technique. Policy. Round Robin. Output is assigned equally starting from the first element.. First Come - First Served . Priority Based. Output control is assigned in request order. All packets are assigned a priority and get output control according to their importance.. Priority Based Round Robin. Round Robin is implemented but a priority proportional to the frequency of usage is assigned.. 28.

(39) embedded system design; because of that, an optimal placement and integration of buffers is required. Figure 2.4 shows a router with input VCs, which are in principle, connected to all possible outputs. In [31] studies show that for unicast routing, having a VC per output at each input can reduce area consumption significantly; with this result, the router and VC integration can be seen in Figure 2.5.. Figure 2.4:. Block diagram of a Router with Virtual Channels. Area constrains have to be considered to choose an appropriate number of buffers.. Now that all specifications related to the router’s behaviour are defined a high level block diagram of it can be constructed; no major implementation details are shown for it is an abstraction of the real hardware and all functional blocks are software described. Figure 2.6 shows the general block diagram that will be used to describe the TLM model of the router.. 29.

(40) Figure 2.5:. Figure 2.6:. Virtual Channel connections to Router. A single VC per output is available at each input so to decrease area consumption. Extracted from [31].. General Router Block Diagram. Four virtual channels at each input are placed to reach all possible outputs; no packets are routed back through the same input.. 30.

(41) 2.2.1. Router TLM Model. SystemC’s Transaction Level Modelling is a standard for decoupling communication from computation in high level designs; most mechanisms offered by the standard are easily abstracted to bus models because it’s the traditional way to interconnect Systems On Chip (SoCs). As routers use different flow control techniques compared to traditional bus systems, different interpretation of the TLM 2.0.1 phases is required in order for the model to keep faithful to the hardware. Table 2.3 explains phase’s meaning for inter-router, packet based communication. Table 2.3: TLM 2.0.1 Phases Interpretation for Routers. Phase. Flow Direction. Meaning. BEGIN REQ. Init. Router To Target Router. Flit is being transmitted.. END REQ. Target Router To Init. Router. Flit is stored, can be erased. .. on initiator. BEGIN RESP. Target Router To Init. Router. A new space is free. Can send more flits.. END RESP. Init. Router To Target Router. Final reply.. Another addition to the TLM 2.0.1 base protocol, described in the previous chapter, are routing extensions; as mentioned before, extensions can be locally or globally accessed. The proposed model uses both for debugging and verification purposes; a local extension is created on every transaction when they traverse a router and each one adds its own extension to the transaction. It’s got the following fields: (a) Port: Stores the number of the incoming port through which the transaction entered. (b) Port VC: Stores the number of the outgoing port through which the transaction will go out.. 31.

(42) (c) TimesBlocked: Counter that is increased in 1 unit when a router attempts to be transmitted. This allows recognizing deadlock situations. The global extension is created by the initiator, can be accessed by all modules and adds the following information to the transaction object: (a) MainInitiator: Stores the ID of the module that first issued the transaction. (b) FinalTarget: Stores the ID of the module where the transaction is to be delivered. (c) TransID: Records the transaction number for debugging purposes. (d) FlitType: Stores the type of flit of the current transaction. (e) TransCounter: Incremented every time a transaction passes through a router. (f) TransPath: Array for storing the path the flit goes. Used for debugging. At this point it is necessary to clarify that there are four type of flits: Head ones which contain routing information, body ones which are the data itself, tail ones that mark the end of a packet (may or may not contain data) and full ones that are single-flit messages used for (a) sending read requests from one core to the other (end-to-end flow control) and (b) single-flit writes used for barrier operations.. SystemC implementation of the router is composed of five functions that act on each port: I Non-blocking Transport Forward: Is a standard mandatory function that receives three parameters: a transaction pointer, a TLM phase argument and a time value called delay. When a module wants to send a flit, it calls this function with those parameters and a BEGIN REQ as phase argument; the delay time is the time at which the target has to react after getting this call. The function checks the type of flit, space availability, computes the output, returns TLM ACCEPTED and tells the 32.

(43) simulator to execute Forward Payload Event Queue at the time indicated by the delay. Detailed behaviour of this method is shown in Algorithm 2.1. II Non-blocking Transport Backward: Is also a mandatory function that receives the same three parameters but correct phase arguments are either END REQ or BEGIN RESP. If receiving END REQ, a method called Backward Event Queue is notified for execution after the delay time; if BEGIN RESP is received method Transaction Update is notified. III Forward Payload Event Queue: Function invoked by nb forward transport; it takes the transaction object and stores it on the corresponding Virtual Channel and notifies method Transaction Update to be executed after an internal delay time. It also returns phase END REQ back to the initiator to acknowledge the correct storage of the transaction. IV Backward Payload Event Queue: Function invoked by nb backward transport, in charge of double checking that the transaction is correct. Notifies the Transaction Update method for immediate execution. V Transaction Update: Considered the brain of the router; it starts transaction previously stored on the VCs, deletes transactions already sent, notifies modules the availability of new spaces if there are some and implements output arbitration. Algorithm 2.2.1 describes the thoroughly method .. 2.2.2. Traffic Evaluation and Routing Algorithm Testing. MPSoC platforms are generic systems that can implement any algorithm whose intermodule traffic can be known once task partitioning is done; because it is uncertain which application will be executed on such platforms, it is necessary to test synthetic traffic patterns on the chip to establish its performance under random circumstances. There. 33.

(44) Algorithm 2.1 Non-blocking Transport Forward. Require: Transaction object, phase, delay 1: 2:. if phase = BEGIN REQ then if (F litT ype = Head) or (F litT ype = F ull) then. 3:. OutP ort = Value returned by Routing Algorithm.. 4:. if (VC Empty) then. 5:. Reserve Virtual Channel. 6:. Set response status to TLM OK RESPONSE. 7:. if (OutPort Free) then Take control of OutPort. 8:. end if. 9:. else. 10: 11:. Set response status to TLM GENERIC ERROR RESPONSE. 12:. Return TLM ACCEPTED. 13:. end if. 14:. Notify Forward Payload Event Queue to execute after delay tim. 15:. Decrease VC Space. 16:. Return TLM ACCEPTED. 17:. else if (VC has space) then. 18:. Set response status to TLM OK RESPONSE. 19:. Decrease VC Space. 20:. Notify Forward Payload Event Queue to execute after delay time. 21:. Return TLM ACCEPTED. 22:. else. 23:. Set response status to TLM GENERIC ERROR RESPONSE. 24:. Return TLM ACCEPTED. 25: 26: 27: 28: 29: 30:. end if else if phase = END RESP then Return TLM ACCEPTED else Abort Execution end if. 34.

(45) Algorithm 2.2 Transaction Update Method Implemented by Routers Require: Virtual Circuit to Update Require: InputPort, OutputPort 1: 2:. if A transaction on VC can be started then Call non-blocking forward method on the next module with phase BEGIN REQ.. 3:. end if. 4:. for i = 0 to V CSize do. 5:. if A Transaction can be freed then. 6:. Delete transaction.. 7:. Increase VC space.. 8:. Call non-blocking backward method on the previous module with phase BEGIN RESP to indicate that a new space is available.. 9:. if Transaction is type “Tail” then. 10:. Free Virtual Circuit.. 11:. Stop controlling Output Port.. 12: 13:. end if end if. 14:. end for. 15:. if Output Port is not busy then. 16:. for i = 0 to Number of Router Inputs do. 17:. N ewInput = InputP ort + 1. 18:. if NewInput is ready to use OutputPort then. 19:. Give NewInput control of OutputPort.. 20:. Execute again from the start.. 21: 22: 23:. end if end for end if. 35.

(46) are a few typical tests conducted on NoC designs that help realizing routing algorithms performance: Uniform Traffic Nodes communicate with each other with the same probability. Matrix Transpose Traffic Each node sends messages only to a destination with the upper and lower halves of its own address transposed. Hotspot Traffic Each node sends messages to other nodes with an equal probability except for a specific node (called Hotspot) which receives messages with a greater probability. The percentage of additional messages that a Hotspot node receives compared to the other nodes is indicated after the Hotspot name e.g Hotspot 15%. Complement Traffic Each node sends messages only to a node corresponding to the one’s complement of its own address. Several scenarios were tested under some of this traffic conditions and mainly three routing algorithms were implemented: West-First (adaptive), North-Last (adaptive) and XY (deterministic). Additionally, an aspect that hasn’t been studied yet, VC depth, was also considered and the results can be seen on the following figures.. Figures 2.7 and 2.8 show link utilisation under Hotspot 10 % (on node 7) traffic conditions; two groups of figures are provided as all links transmit information in both directions (up-down or right-left); for a better discrimination of link congestion, plots were done separately. XY Routing in Figure 2.7(a) has 4 high traffic links (higher bars); West First in Figure 2.7(b) presents only two congested links and North Last in Figure 2.7(c) just one. On the other direction, Figure 2.8(a) shows XY with 2 congested links, Figure 2.8(b) 36.

(47) presents West First behaviour with 1 high traffic link and Figure 2.8(c) has 2 congested links with North Last routing. From the previous figures, apparently West First and North Last routing better spread traffic along the network as they only have 3 high traffic links in both corresponding graphs. No turnaround links show significant utilization despite the adaptiveness of those algorithms. In order to check the overall behaviour of all routing algorithms under this traffic pattern, plots for average flit latency and total simulation time are shown in Figure 2.9. From 2.9(a) it can be seen that the more Virtual Channels are, the more the flit latency on the network; that is because NICs can inject more packets into the network at any given time; adaptive algorithms show lesser values than XY’s, indicating that information is forwarded faster with them. Figure 2.9(b) gives more information about the routing performance; for large messages XY and North Last perform better than West First regardless the Virtual Channel depth. For shorter transmissions West First decreases simulation time. In general, results presented for this traffic pattern are very close to each other and might need a deeper analysis to make a routing decision.. A second pattern was studied under the same conditions as before; Matrix Transpose traffic was implemented and results are shown in figures 2.10, 2.11 and 2.12. From graphs 2.10(a) and 2.11(a), 8 congested links can be distinguished when using XY routing; West First behaviour, shown in 2.10(b) and 2.11(b), only present 2 high traffic links, as well as North Last routing in 2.10(c) and 2.11(c). Although, as before, adaptive algorithms attempt to better distribute traffic throughout all available paths, results from 2.12 demonstrate that long messages with low Virtual Channel depth get faster to their destination with XY-Routing, which is also, the one with lowest flit latency; medium size messages are more suited to West First-Routing under this pattern.. 37.

(48) (a) XY Routing. (b) West First Routing. (c) North Last Routing. Figure 2.7: Link Utilisation for Hotspot 10 %. Traffic going Right - Down. 38.

(49) (a) XY Routing. (b) West First Routing. (c) North Last Routing. Figure 2.8: Link Utilisation for Hotspot 10 %. Traffic going Left Up. 39.

(50) (a) Average Flit Latency. (b) Total Simulation Time. Figure 2.9:. Timing statistics for Hotspot 10 %. Traffic. All routing algorithms were evaluated under several message size and Virtual Channel depth conditions.. 40.

(51) (a) XY Routing. (b) West First Routing. (c) North Last Routing. Figure 2.10: Link Utilisation for Matrix Transpose Traffic going Right - Down. 41.

(52) (a) XY Routing. (b) West First Routing. (c) North Last Routing. Figure 2.11: Link Utilisation for Matrix Transpose Traffic going Left Up. 42.

(53) (a) Average Flit Latency. (b) Total Simulation Time. Figure 2.12:. Timing statistics for Matrix Transpose. Traffic. All routing algorithms were evaluated under several message size and Virtual Channel depth conditions.. 43.

(54) Despite the fact that more traffic patterns could have been evaluated for the scope of this study it is enough to show the capabilities of the constructed model. Each of the timing graphs (3D plots) considered 8 message sizes and 8 Virtual Channel depths, that is 64 simulations in total. Each run used 100 messages on each of the 16 nodes for a total of 1600 messages that varied from 6400 up to 51200 flits being transported along the NoC.. 2.2.3. Router VHDL Model. Once the router’s behaviour was validated with the high level model presented, a detailed HDL design was implemented; three main blocks compose this design: Input Port, Output Port and Multiplexers. Input ports are in charge of data reception control, routing and Virtual Channel storage; Output ports control data transmission and round robin channel arbitration; multiplexers interconnect all input buffers with the router’s outputs. Because of the flit size, that is 34 bits, there are 34 lines for data transmission and 34 for data reception; also, two lines are used for handshaking transmission control, Tx and Tx Ack and two lines for reception control, Rx and Rx Ack. In summary, each port has 36 inputs and 36 outputs. Figure 2.13 shows the router’s black box.. Input Port module is composed by 4 FIFOs, a routing unit, one multiplexer and one de-multiplexer; deterministic XY routing was chosen for state machine implementation. Flow control was implemented according to the studies presented in section ??, where handshaking signals were selected. The control module receives incoming requests from external modules through the Rx input; it sends a request signal to the routing unit who replies back when the output has been computed; after getting a response from the routing module, the flit is stored at the corresponding queue, if space is available. When no space is available, Rx Ack line remains de-asserted. The routing module was designed with a state machine that uses external comparators to determine whether the coordinates of the destination are larger or smaller than the router’s. Depending on the results of all compares, an output is computed and stored into 44.

(55) Figure 2.13:. VHDL Router Black Box. Five ports are needed for a torus or (most routers on) mesh configurations.. an internal register; block diagram of this module is presented in Figure 2.14 and the one for the whole InputPort is shown in 2.15. Output Port module is composed of a control unit in charge of arbitrating outputs and negotiate data transmission with another router or NIC; also, two de-multiplexers a Flit decoder and a multiplexer are included into this big object. Flit decoder is in charge of notifying the control when a tail or single flit has been transmitted so that it assigns the output to another input. A block diagram of this box is shown in 2.16. Finally, the full block diagram of the router is shown in Figure 2.17. VHDL code is attached at the end on the Appendix section. Due to the number of input/outputs of this module, implementation can only be possible on an ASIC, however, FPGA synthesis allowed us to know some information about area consumption. Studies shown in [32] lists statistics about the number of slices consumed on a Virtex-II FPGA; a 5 input/output router consumed 397 slices. Also, [8] obtained a 1762 CLB consumption on a Virtex-II8000 FPGA.. 45.

(56) Figure 2.14: VHDL Block Diagram for the XY Routing Module.. Figure 2.15:. VHDL Block Diagram for Input Port Module. Four FIFOs are needed to route information to each output port.. 46.

(57) Figure 2.16: VHDL Block Diagram for Output Port Module. Our router model was synthesised on Virtex5. Virtual Channels were generated as FIFO memories with Xilinx’s IP Core Generator with 16-flit depth. Resource utilisation is shown in Table 2.4. Because low-level detailed designed was not the objective of this work, HDL simulations are skipped on this document yet VHDL code is attached at the Appendices section. It is important to note that pin-out of all previous modules was not enough for synthesizing a single router; however, if more of this modules are embedded, a small network of them can be constructed and the chips could be plugged to external processors with FPGAs outputs.. 2.3. Network Interface Card Architecture. Network Interface design is intended to support and validate message passing transactions which are composed of two tasks for communication, send and receive, and one for syn-. 47.

(58) Figure 2.17:. VHDL Block Diagram for the Router. Multiplexers shown on diagram are the same as the ones shown in Figure 2.16 for data selection.. 48.

(59) Table 2.4: Router Area Consumption on Virtex 5 (XC5VFX30T-1FF665) Device Utilisation Logic Utilisation. .. Used. Available. Utilization. Number of Slice Registers. 730. 20480. 3%. Number of Slice LUTs. 846. 20480. 4%. Number of fully used LUT-FF pairs. 230. 1346. 17 %. Number of bonded IOBs. 372. 360. 103 %. Number of Block RAM/FIFOs. 10. 68. 14 %. Number BUFG/BUFGCTRLs. 1. 32. 3%. chronization called barrier. Those functions were taken from the MPI standard and suffice the functionality required.. 2.3.1. Network Interface TLM Model. SystemC TLM Model of the NIC has one target socket for receiving the core’s transaction, one initiator socket for sending data to the local router and another target socket to get data from it; for each target socket there is a corresponding nb forward transport function and for the initiator socket, a nb backward transport method is provided. On sockets connected to routers, TLM phases are interpreted the same way as stated on Table 2.3, however, on sockets connected to processing elements (end-modules), phases are considered as specified by the standard.in Section 1.3. In order for the system to react at the appropriate time (because of transaction delays), there are three payload event queues linked to each nb transport function. Other methods are in charge of standard operations such as storing data on send or receive buffers, arbitrate output control, reply to processing elements, etc. Next a list of all NIC method’s functionality is presented. I. BuildHeadFlit: Method in charge of creating a transaction’s header flit. It stores 49.

(60) message number, type of flit and initiator and target addresses on a single word. II. GetHeaderInfo: Is in charge of extracting all header information from a head flit. III. CheckIfExpectedTransaction: Method invoked when a new request arrives; it is in charge of establishing whether it corresponds to a write transaction started by the local processor or not. If the transaction doesn’t match the expected one it’s stored at a incoming requests buffer. IV. StoreAtSendBuffer: Function used for storing flits at a send buffer (if a write transaction) or at the send request buffer (if a read transaction). V. StoreAtReceiveBuffer: In charge of storing flits at a receive buffer (if a read transaction) or at the receive request buffer (if a write transaction) when the request doesn’t match with the processor request. VI. RESPONSE TransactionUpdate: Dynamic-event triggered method used for sending phase BEGIN RESP back to the router when a tail flit has been received; it also sends BEGIN RESP to the local processor to indicate it that data is ready to be transmitted. VII. REQUEST TransactionUpdate: Method invoked to send a new write request when the timer has expired. VIII. RRESPONSE TransactionUpdate: Is in charge of returning phase BEGIN RESP back to the router when a read request has been received. IX. SEND TransactionUpdate: Acts as a central control unit for the NIC module; this method checks whether the IncomingRequestQueue has a valid transaction that matches the one specified by the processor, if so, grants output access to the send queue. After that, sends the first flit on the send buffer and updates debugging information; after that, frees already transmitted flits from the queue and checks if. 50.

(61) it was a tail flit, if so, releases the output port. More tasks are performed by this function and can be better described by pseudo-algorithm 2.3.1. Apart from reading and writing, all cores are capable of executing barrier operations for synchronization. Depending on the core’s ID, a barrier is implemented differently: there must always be a master core and one or more slaves cores; master cores await for slaves to send a barrier message and once got everyone’s, they issue a command for all of them to resume executing their tasks. Because it is necessary to address all nodes when issuing barrier transactions from the master core, and because routers are unable to realize of that, NICs were designed to support a broadcast command that sends the same data to every node. This functionality is also useful when processors need to share information stored at one of them, however, it won’t be until a node gets requests flits from all the others, that it will start transferring data; this approach might prove useful in some scenarios but can also decrease overall performance on others. In order to improve performance and reduce processor computation the NIC implements barrier operations as follows: Slave cores : Send a normal write request transaction to the master core and expect a one-flit write. Master core : Builds a single-flit write transaction and stores it at the send buffer; when requests from all modules are received, it sends that flit to all the modules. When all flits have been transmitted the NIC replies back to the core. Mechanical computation implied by the barrier function is done at the NIC so that the core can perform other operations; the cost of that is an increase in area consumption.. TLM phases for read operations can be seen in Figure 2.18. This transactions take a long time to complete because once the NIC is notified of a read transaction, it sends a request-data flit to the appropriate module and has to wait for information to come; 51.

(62) Algorithm 2.3 Transaction Update pseudo-algorithm implemented by NICs. 1: if Write Pending and Not Read In-progress then 2:. Check Request queue.. 3:. if Transaction Requested is expected then. 4: 5:. Give Output control to Send Queue end if. 6:. end if. 7:. if Send Queue controls Output then. 8:. Send first flit on Queue. 9:. if Flit accepted then. 10:. Mark Flit as accepted.. 11:. Notify method for later execution to delete Flit.. 12:. end if. 13:. if Write is Unicast then. 14:. Delete transmitted flits in Send Queue.. 15:. end if. 16:. if Write is broadcast then. 17:. if Write is Burst and Burst Completed then. 18:. Create new Head Flit.. 19:. Notify method for later to start transmission to next node.. 20:. Reset Transmission counters. 21:. end if. 22:. end if. 23:. if Write is Burst and Not all data packed then. 24: 25:. Store next flit at Send Queue end if. 26:. end if. 27:. if All data is transmitted then. 28: 29:. Send phase BEGIN RESP back to Initiator to release transactions. end if. 52.

(63) there is only after getting all packets that the processing element is notified about the data availability, and the transaction concluded.On the other hand, write transactions between the NIC and the processing element can be finalised faster. A phase diagram for write transactions can be seen in Figure 2.19.. Figure 2.18:. TLM Phases in a NIC Read Operation. CPUs ask for data, NICs send a request to the corresponding module and waits for data to arrive. After all information is received, phase BEGIN RESP is issued to the CPU to indicate the end of the transaction.. 53.

(64) Figure 2.19:. TLM Phases in a NIC Write Operation. Processing elements send all data to the NICs and finalize the transaction after transmitting all the information. NICs await a read request and send packets when the corresponding one is received.. 54.

(65) 2.3.2. Network Interface Hardware Design. Network Interface design was extrapolated from the SystemC high level description; a high HDL complexity was found on this module as it has to implement part of the router’s functionality, solve end to end flow control and communicate with the processing element through the OCP-IP bus model. Several control units were necessary for this design to support all the features implemented in SystemC listed in the previous section; because of space constrains, a general block diagram of the overall module is shown in Figure 2.20; control signal paths are shown in red and data path ones, in yellow.. To better understand the figure, a correspondence between the TLM 2.0 model and the VHDL one is presented in Table 2.5; although the equivalence is not exact, it tries to match the main aspects. Functions shown in the table are also described in the previous section.. One of the most complex modules of the NIC was the OCP-Handshaking Control and required careful design in order for it to support transactions and respect their execution order; from the diagram in 2.20 it can be seen that another control unit (End-to-End Flow Control) was necessary. State diagrams of both are shown in Figures 2.21 and 2.22. When a new transaction is started from the processing element, handshaking control verifies whether is possible to initiate it internally; if that is possible, an appropriate header flit is stored at the corresponding queue and information is packed (for write transactions). End to end flow control is notified of the operation in progress and commands transmission and reception units to do the necessary operations to carry on with the transaction: if a read is to be performed, a request flit is sent to the corresponding module; if a write is requested, reception control must report itself when a request matching the write address is received.. VHDL implementation of the NIC is left for future work as it doesn’t constitute a common test metric on the NoC field. 55.

(66) Table 2.5: VHDL-SystemC equivalence of NIC blocks . VHDL Block. TLM Methods/Objects. OCP-Handshaking Control. nb transport fw,. RESPONSE Transfer data from (to) process-. Transaction Update End to End Flow Control. Function. ing elements. Target Payload Event Queue, Execute transactions tidily. Check If Expected Transaction. Tx Control. nb fw router, SEND Transac- Initiate tion Update. Rx Control. FIFO DataIN and Requests. transactions. with. transactions. from. router. nb transport bw router, Store Receive At Receive Buffer. router. Double-ended queue. Store data read and incoming requests. Bank DataOUT. Double-ended queue. Store data out and read requests.. Rest of Blocks. Build Head Flit, Get Header Set and retrieve head flit inforInfo. mation. 56.

(67) Figure 2.20: VHDL Block Diagram for Network Interface Card. 57.

(68) Figure 2.21: State Machine for the Handshaking Control. 58.

(69) Figure 2.22: State Machine for the Handshaking Control. 2.4. Software Performance Results. After validating both the router and NIC TLM models, software applications were programmed to analyse performance results with the whole NoC. Matrix multiplication was implemented for its straightforward parallelization; previous performance graphs could also be obtained but are not shown for space constrains.. 2.4.1. 4 × 4 Matrix Multiplication. The first test scenario was a 4 × 4 matrix multiplication split into 16 cores (1 master, 15 slaves) where each one performed a row-column product and returned its result back to the master. MPI directives such as MPI Send, MPI Receive and MPI Broadcast were used for data sharing between modules. Figure 2.23 shows the full Network On Chip performance regarding operation time for three routing algorithms and several Virtual Channel depths. In concordance with 59.