• No se han encontrado resultados

22. Ventajas y desventajas de la planta

22.1. Ventajas de la planta

We synthesised our Verilog design for two different targets, a ZedBoard XC7Z020-CLG484 FPGA and a silicon target for a manufacturing process. The Verilog design consists of two processors connected by a switch. In this section we briefly discuss the result of the synthesis process and mention prominent findings.

4.3.1

FPGA Synthesis

Component LUTs Registers LUTs as logic LUTs as memory

Entire Design 18,646 11,102 16,022 2,624

Switch 63 50 63 0

Core0 9,294 5,526 7,982 1,312

Core1 9,289 5,526 7,977 1,312

Table 4.3: FPGA resources used by the design. The first row shows how much resources are consumed in total. The bottom three rows show how many of these resources the two cores and the switch use individually.

An FPGA uses Look-Up Tables (LUT) to implement logic. A LUT can be described as having an arbitrary number of inputs and one output. It can then be programmed to assume a certain output value

4.3. SYNTHESIS RESULTS

depending on the inputs. It essentially can model any kind of logic gate or a combination of logic gates. The FPGA used in this project has 53,200 LUTs in total, of which the entire design uses 18,646, as can be seen in Table4.3. In other words, our design utilises 35% of the FPGA’s LUTs. The design only uses 10% of the available registers. Most LUTs are used as logic (86%) and only 14% are used as memory. As can be seen in the second row, the switch barely consumes any of the resources available on the FPGA. The cores on the other hand make up most of the design. It might look a bit curious thatCore0uses 5 more LUTs thanCore1. It can be seen from Table4.3that these LUTs are used as logic and as such are an artifact of the synthesis process, which is non-deterministic. As the switch uses a negligible number of resources, the proportion of resources used by the individual cores is basically the same as that of the entire design.

Component LUTs Registers LUTs as logic LUTs as memory

Core0 9,294 5,526 7,982 1,312

Datapath 7,372 5,334 6,060 1,312

Decoder 1,916 160 1,916 0

Fetch 8 32 8 0

Table 4.4: FPGA resources used by a single core. The bottom three rows show the number of LUTs consumed by the individual components ofCore0.

If we examine Table4.4, we observe that the datapath consumes most of the resources inside the core, it uses 7372 LUTs and 5334 registers; in other words, 79% of all the LUTs and 57% of all the registers used by the processor. It also contains the entirety of memory used within a core. The decoder uses 1916 LUTs and 160 registers, which means it uses 21% of all the LUTs and only 2% of registers allocated to the processor. Finally, the fetch unit only uses a minimal number of 8 LUTs and 32 registers corresponding to the instruction buffer.

Component LUTs Registers LUTs as logic LUTs as memory

Datapath 7,372 5,334 6,060 1,312

Autonomous links 4,061 4,599 4,061 0

RAM 1,484 0 172 1,312

Register file 1,408 503 1,408 0

Other 419 232 419 0

Table 4.5: FPGA resources used by the major components within the datapath.

As the datapath is such a major component of the processor, Table 4.5 shows which of its parts use the largest number of resources in the FPGA. The autonomous controllers, which provide external communication and support for the I/O interface use over half of all the LUTs and almost 90% of registers in the datapath. The RAM on the other hand contains all the LUTs used as memory within the processor, yet very little logic. The register file, which contains the registers accessible to the programmer uses 1408 LUTs and 503 registers, or 19% of all the LUTs and 9% of all the registers in the datapath.

It is surprising to see the datapath, and in particular the autonomous link controllers consume that many FPGA resources. In an earlier iteration of the design, the control unit written in behavioural Verilog had taken up the majority of the resources, as it was synthesised to contain large collections of multiplexers. In this iteration, it hardly uses up any resources, which is testament to how well the Verilog generated by the microcode assembler integrates into the processor design.

The reason why the link controllers use so many resources can be explained by their sheer numbers: there are one output and 16 input controllers per processor, more than twice as many than in the Inmos Transputer. Each link encompasses the microcode sequencing logic similar to that of the processor’s con- trol logic and three different interfaces to DMA controller, the physical link and the CPU. As mentioned

CHAPTER 4. CRITICAL EVALUATION

before, this should be the subject of optimisations in future versions of the OpenTransputer

4.3.2

Synthesis Timing Analysis

The Verilog design runs at 41 MHz on the FPGA. The maximum possible clock frequency is limited by logic paths with the longest delay in the circuit, also known as critical paths. We expected the critical path to be within the control unit, which is implemented by large amounts of behavioural Verilog. Nevertheless, the synthesis timing reports show that the critical paths are mainly associated with the autonomous links. In particular, the critical paths that cross the interfaces to the CPU are the ones with the longest delay.

An example critical path entails an intermediate register in the output link controller that is used to implement the interface with the CPU, and another register that stores the current working values of the controller. When a process wishes to perform an output operation, the CPU places the relevant information in the intermediate register, then the value is stored in the controller’s work register in another clock cycle. The great delay associated with this path is explained by the fact that these registers are placed at a physically large distance from each other by the route and place algorithm used by the synthesiser. The algorithm instantiates some of the intermediate registers that are logically part of the links nearer to the core processor while putting other registers closer to the links, increasing the distance between them.

4.3.3

Manufacturing Process

To put our design in perspective, it is useful to compare it to a manufactured processor. For this purpose, we synthesised our Verilog design for an actual silicon target with 180µm technology using Synopsys Design Vision and compared it to the original Transputer.

Once more, it is important to highlight that the microarchitecture of both designs is radically different despite implementing the same architecture. In some respects, the OpenTransputer is more complex than the Inmos Transputer since it uses a wide datapath with multiple instances of the same logic modules. On top of this, the OpenTransputer is still in its early stages of development and is not fully optimised, but due to technological advancements in the last two decades we expect to observe some sort of relation.

OpenTransputer Transputer

Area 3.69 mm2 64 mm2

Manufacturing process 180 nm 1000 nm

Table 4.6: Comparison of chip area and manufacturing process of the OpenTransputer and the original Transputer after synthesis.

Moore’s law refers to the observation that in the history of modern computing the number of transistors in an integrated circuit doubles every two years [19]. Keeping this mind we can make some interesting observations about the synthesis results listed in Table4.6. We see that the area of the OpenTransputer is 3.69 mm2 while the Transputer in 1985 had an area of approximately 64 mm2; in other words, there is a decrease in area by a factor of 17.3. Since the OpenTransputer is more complex than the original Transputer, i.e. is made up of more hardware components and by extension transistors, an explanation for the area reduction is that the individual components have shrunk in size. As we see in the second row of Table 4.6, this has indeed been the case, the OpenTransputer is targeted at 180 nm technology, while the Transputer has been targeted at 1000 nm. This implies a reduction in the size of transistors by a factor of 5.6.

If the OpenTransputer and Transputer were completely identical in terms of transistor count, then, according to Moore’s law, the decrease in area by a factor of 17.3 would only be due to a reduction of the

4.3. SYNTHESIS RESULTS

target technology by the square root of this factor, i.e. √17.3 or roughly 4.2. This implies that (to some extent) the area reduction of the OpenTransputer with regards to the Inmos Transputer follows Moore’s law. The difference between the technology reduction factor (4.2) and the total area factor (5.6) can be attributed to the lack of optimisations and the more complex microarchitecture of the OpenTransputer.

Chapter 5

Conclusion

5.1

Current Project Status

We have developed a new implementation of the Transputer architecture that we call OpenTransputer. We have designed a radically different microarchitecture that takes advantages of state-of-the-art man- ufacturing techniques and current developments in the field of computer architecture. The project can be broadly divided into three major components as per our initial aims and objectives: CPU, external communication and I/O interface.

With regards to the OpenTransputer CPU, we replaced the original microcode ROM in favour of hardwired logic generated from a microprogram assembler. Also, we implemented awide datapath that replaces the original bus system used in the Inmos Transputer. Furthermore, module replication was heavily used enabling the OpenTransputer to perform more simultaneous operations within the same clock cycle than the 1980s design. This has the effect of greatly decreasing the average number of cycles taken by most instructions to execute as described in Chapter 4. On the other hand, there are still a number of instructions that our design does not recognise.

We introduced significant changes into the external communication mechanism used by the Trans- puter. The Inmos design was equipped with four serial communication links that could be used to connect the processors together and assemble parallel processing networks of arbitrary size. In the interest of mak- ing the OpenTransputer easier to use as a building block for any sort of system we have replaced the four serial links for a single bidirectional parallel connection to a network of switches. These networks are arranged in a Beneˇs fashion providing rearrangeably non-blocking communication between all Open- Transputer nodes. We have also developed a new message routing mechanism that uses the addresses of the Occam channels to describe the path between two nodes in the network.

The OpenTransputer includes drastically different input and output controllers for communication that implement the concept ofvirtual channels. This approach enables the processor to keep track of a virtually unbounded number of simultaneous output operations as compared to the original Transputer that only supports up to four. Once more, the effect is to improve the usability of the processor as a building block. Moreover, the Beneˇs network configuration makes it possible to efficiently map a broad range of networks, including neural networks, to a parallel processing system built of OpenTransputers.

Since we envision the OpenTransputer to be used as part of mass consumer products within the IoT realm an I/O interface that is compatible with a range of hardware components, such as sensors and actuators, is required. Despite not fully achieving this goal due to the time constraints of this project, we have developed a basic I/O interface built upon the channel communication functionality. This means that even the most simple Occam program that simply outputs an integer to a channel can drive

CHAPTER 5. CONCLUSION

hardware peripherals. Currently, the OpenTransputer I/O interface does not provide enough flexibility to implement communication protocols that are commonly used by hardware peripherals. However, the interface has been designed in a way such that new features can be easily added as described in Section3.3. Our implementation was fully developed in Verilog HDL using the Xilinx Vivado Design Suite. By the end of the project a dual-core system with a single communication switch was successfully synthesised for an FPGA target. This enabled us to develop a simple demo application to showcase the capabilities of our implementation. The synthesised design runs at 41 MHz and utilises approximately 35% of the target FPGA. To our surprise, the majority of the logic resources are consumed by the autonomous link controllers rather than by the core CPU components. This is possibly the result of introducing support of virtual channels and extending the number of input controllers from 4 to 16. We consider that future work must focus on optimising these components of the OpenTransputer system to reduce the area of the design as a whole.

Documento similar