Frequently in the design flow, it is desired to develop and optimize software while the hardware is being designed, manufactured and tested. Performing efficient simulations are one of the options for exploring the software techniques for the utilization of various hardware platforms in the development phase. As the technology trends indicate chips with increasing number of cores, the need for performing simulations in order to observe the results and the challenges of the multicore architecture are essential. Even though there are many multicore network on chip simulators like NocSim [23], ORION [24] , PhoenixSim [25], as per the reasons mentioned in section 1.6, the Graphite multicore simulator, proposed in [9], is used for performing the simulations in this thesis. Unlike NocSim, Graphite simulator can model both electrical and photonic NoC architectures. The ineffective router modelling and power modelling techniques used in Orion that also only models electrical NoCs are not suited for the purpose of this thesis. PhoenixSim implements photonic networks while using the electrical network modelling based on the inefficient Orion simulator without any proper interconnection between the two types of networks. Graphite simulator has more effective implementation of both electrical and photonic architectures along with proper efficient interconnects It provides the output values base on a set of input parameters for each simulation, which can be used to quantify the performance of a system
58 for that specific configuration. It uses the pin tool from Intel for dynamic binary translation, which translates the instructions into the respective binary format for the simulator.
Figure 24: High level achitecture of Graphite simulator [9]
According to Figure 24, the application threads are assigned to a tile of the target architecture and then these threads are distributed across different host processes as host threads which execute on host machine(s). The actual thread scheduling and execution in the host system is then handled by the host operating system[9].
Core modeling: Core modeling handles the instruction fetch and decode units, load and store units as well as execution units in the cores[9] in Graphite. As per [9], the core model is a pure performance model which models the simulated clock exclusive to each tile. It is modelled based on the producer-consumer design, where the different parts of the system
59 generate information and the model consumes it. It primarily decodes and executes the instructions from the binary translations by the pin tool of the instructions from the application threads along with the pseudo instructions for updating the local clock [9]. The main assumption here is a constant cost for instruction execution except for memory and branching operations. Graphite implements an in-order core model, which means that the instructions are fetched, decoded and executed in the order of the request on an out-of-order memory system which stores data in a random order[9].
Network modeling: it describes the modeling of the on chip networks in the simulator. Graphite supports both electrical and optical networks. It supports 5 different types of network models which are categorized as 2 for user-level messages, 2 for shared memory messages and the last one is used for system messages. Our experiments focus only on shared memory networks where the assignment of threads to cores are performed by the simulator and not on message passing where the message transfer between the cores can be directed by the user. Graphite uses synchronization schemes to emulate in order simulation to an extent. Graphite, is not a cycle accurate simulator[52], which means that the simulation is faster but not accurate to the cycles of operation , which means that the simulations can have slight variations in the virtual time of execution. The lax barrier synchronization is used for all the experiments in this thesis, as it also provides the most accurate results compared to the lax and lax p2p synchronization schemes[9]. The lax barrier scheme waits on a fixed number of cycles for all the executing threads to synchronize and this process continues until the task completes execution.
Graphite supports two types of models for electrical mesh networks, which are: Emesh hop counter and Emesh hop by hop. The Emesh hop by hop network is the model considered for
60 our experiments as it provides the most accurate modelling with a trade-off in performance as compared to the Emesh hop counter model. The Emesh hop by hop model uses XY routing. In this routing scheme, the packet travels along the X direction or horizontally, until it reaches the column containing the destination core and then proceeds with the routing in the Y dimension [52]. Another advantage is that only the hop by hop model offers contention delay modeling for both the user and memory networks, which is one of the output values observed in our experiments.
Power Modeling: it is carried out in Graphite simulator using Design Space Exploration for Network Tools (DSENT) [53]. We trace the static and dynamic energy consumption in the memory network provided by DSENT for our experiments. DSENT can be used in two different instances for power modeling, firstly when an application is being simulated and on the other hand, it is also possible to calculate the standalone power traces for the architectures. We are interested in the former case, as we estimate the total energy consumption in the network while executing the application with a specific set of configuration values. The experiments in this thesis observe the static and dynamic energy consumption in the network. Static power indicates the non-data dependent power which includes the laser power, ring heating and thermal tuning power. The dynamic power includes all the data dependent power like routing data-path, electrical links and receiver networks.