• No se han encontrado resultados

In the following Figures we can see the behavioral statistics of the three prefetch engines for the different levels of aggressiveness using our modified version of gem5. The baseline for the

3.4 Performance study 57

Hardware specifications Values

ISA x86

CPU model TimingSimple

Tile number 16

L1 Data/Instruction cache 16Kb each tile

L2 size 8MB

Network Garnet

Topology Mesh

Virtual Networks 3 Virtual channels per virtual network 2 FlitBuffer size 6 flits Bandwidth factor 16 bytes Data paquet size 64 bytes Ctrl paquet size 8 bytes Prefetcher cache level L1

Simulated cycles 350 millions of cycles Table 3.3 Simulation environment specifications.

Pref Param Low Aggr Medium Aggr High Aggr

TAGGED AGGR 1 2 4

RPT AGGR 2 4 8

GHB WIDTH 2 4 8

DEPTH 2 4 8

Table 3.4 Detailed definition of the level of aggressiveness for each prefetch mechanism.

experiments was the same system with no prefetching mechanism. Figure 3.11 displays the behavior of the memory module, showing the MPKI (number of Misses Per Kilo-Instruction) in each scenario. This metric shows the average number of L1 misses every 1000 instructions. In a simulator in which the effect of the prefetcher is only represented in the memory model, variation of the MPKI from the baseline configuration would proportionally modify the speedup in the experiment. However, this is not necessarily the case if the NoC effect is also considered. As can be seen, in most cases, as the aggressiveness of the prefetcher increases, there are more chances of bringing in useful data, thus reducing the MPKI. GHB and RPT show this behavior, although increasing the aggressiveness of RPT from medium to high does not provide more benefit. However, in the case of the Tagged prefetcher, when the aggressiveness of the prefetcher is increased, the MPKI also increases.

Figure 3.12 provides a detailed illustration of this, showing the number of prefetch operations issued per kilo-instruction, categorized using the statistics generated from the

Fig. 3.11 Average MPKI calculated with the gem5 simulator and the prefetching module.

prefetcher-profiling module described in Section 3.3.4. While the Tagged prefetcher increases the number of prefetching operations issued in proportion with the increase in aggressiveness, the RPT and the GHB prefetchers do not. The reason for this is that the Tagged prefetcher works in a more simplistic way. As the Tagged prefetcher with medium or high aggressiveness issues a lot of prefetch operations and the accuracy is low, the cache becomes filled with a lot of useless data – pollution – which evicts useful data. This effect leads to an increase in the number of misses for this cache level.

3.4 Performance study 59

Figure 3.13 shows the consequences of this increase in requests on the NoC. This figure shows the average latency (in cycles) for a miss in L1. If we compare Figure 3.12 with Figure 3.13, it is easy to see a strong correlation between the increase in the number of operations issued by the prefetchers and the increase in memory latency. The higher the level of aggressiveness, the more the L1 misses latency increases. An explanation for this may be that the higher the memory requests, the greater the number of conflicts in the buffers and the higher the waiting time in the routers. For this reason, memory requests spend more time traversing the network.

Fig. 3.13 Average L1 miss latency (cycles) calculated with the gem5 simulator and the prefetching module.

This information shows the benefits and the drawbacks of a prefetcher in a multi-core system. To show the combination of these benefits and drawbacks, Figure 3.14 shows the IPC speedup for each aggressiveness level. The IPCs were calculated by dividing the amount of x86 instructions executed by all the cores between the simulated cycles. It can be seen that there is no a direct link between the speedup and the MPKI. This conclusion is especially important as this would not be the case with other simulators which do not take the NoC into account. For example, the Tagged prefetcher with medium aggressiveness reduces the MPKI, yet it degrades performance in comparison with a system without prefetching. A conclusion to draw from these numbers is that performance studies using prefetching techniques (or any other mechanisms that affect the traffic in the network) which do not take the NoC effect into account, may make erroneous conclusions. For example, if one were to select the best mechanism and aggressiveness for the framework described in this study, taking into consideration only the results from the memory hierarchy, using the MPKI, the wrong choice would be made. It is important to note that, as one can see in Figure 3.11, the mechanism with the greatest reduction in MPKI is GHB with high aggressiveness. However, Figure 3.14 shows that the prefetch mechanism which provides the greatest speedup is GHB with low

aggressiveness. Thus, this shows that taking more than just the MPKI reduction into account ensures that more suitable and well-founded decisions can be made.

Fig. 3.14 Average IPC Speedup calculated with the gem5 simulator and the prefetching module.

The statistics provided by gem5 with the prefetching framework extension clearly show that the Tagged prefetcher works by means of brute force. As the aggressiveness increases, the number of generated requests grows by the same proportion. In contrast, the RPT prefetcher uses a very strict confidence strategy that greatly reduces the number of prefetch requests generated by the prefetcher. Even when the aggressiveness is increased in the same way as with the Tagged prefetcher, the number of requests generated by RPT with high aggressiveness is less than 50% of the number generated by the Tagged prefetcher at the lowest level of aggressiveness, and the percentage of useful prefetches is higher. This means that the RPT prefetcher is less aggressive and more accurate than the Tagged prefetcher. As mentioned earlier, GHB is a correlation prefetcher, which does not generate a large number of requests until the prefetcher has saved enough information to generate the requests. For this reason, the number of requests generated by GHB grows as the aggressiveness increases, but to a smaller degree than the Tagged prefetcher does. It is worth noting that GHB is a prefetch engine capable of capturing more complex memory patterns (at higher hardware and computational costs) and thus, the number of useful prefetches that it is able to produce is higher than for the other prefetchers.

To conclude this first analysis, it is important to highlight that the number of operations generated by the prefetcher has a direct impact on the speedup (e.g. MPKI is reduced). For this reason, the MPKI should not be the only statistic considered for the purpose of assessing performance: statistics concerning both speedup and MPKI should be considered at the same level of detail. In this way, we will be able to take better-informed decisions. Tools such

3.4 Performance study 61

as the framework developed in this study are therefore of considerable usefulness for this purpose.

Documento similar