8. ANEXOS
8.2. ANEXO II: Plan de empresa de International Desk
Recalling Section 3.3.2.1 the average branch penalty per packet can be calculated using statistics gathered from simulation. With no branch prediction scheme employed, the penalty in cycles lost per packet for each application is shown in Table 5.8. For a minimum sized 40-Byte packet encrypted using the AES algorithm, 293 processor cycles are lost due to taken branches, while for a maximum sized 1500-Byte packet the branch penalty is almost 3000 processing cycles. Recalling the number of instructions per packet outlined in Table 5.6, the number of instructions required to encrypt a minimum sized packet is 3128, similar to the branch penalty incurred for a 1500-Byte packet. With Internet traffic following a complex distribution, in which large packets make up a large proportion of IP traffic and most applications operate in a greedy mode, a penalty of one minimum sized packet for each 1500-Byte packet would be difficult to ignore. For header applications the branch penalty is smaller but remains a significant loss of processing capabilities. For TRIE-based forwarding, the branch penalty is at least 48 cycles per packet or 7% of the instruction count (Table 5.6).
5.5 Conclusions
While previous research has examined NP workloads, with comparisons to general pur- pose applications, the analysis presented in this chapter attempted to determine and quan- tify those factors which determine PE utilisation. With maximum performance in a
Table 5.8: Branch Penalty Per Packet (ρtk = 5)
Application ρtk Nbr(min) Nbr(max) τtk(min) τtk(max)
AES 0.65 90 2942 293 9562 CAST 0.12 95 2119 57 1271 RC4 0.03 53 1521 8 228 SHA 0.30 290 893 435 1340 MD5 0.10 944 3046 472 1523 FRAG 0.27 10 1775 14 2396 CRC 0.12 54 1526 32 916 RS 0.38 559 19631 1062 37299 TRIE 0.22 44 48 48 53 HASH 0.37 234 240 433 444 HYPER 0.29 150 353 218 512 RFC 0.60 11 32 33 96 TBM 0.47 8 9 19 25 TCM 0.55 11 17 30 47 DRR 0.73 14 33 51 120 STAT 0.35 35 60 61 105
pipelined PE achieved when the pipeline remains full, pipeline bubbles or stalls can sig- nificantly decrease performance. Using the SimNP simulator outlined in Chapter 4 the analysis in this section determined what applications can realistically be supported on a programmable PE platform for various network line rates. While software based imple- mentations of security algorithms would require massive degrees of parallelism in order to support high bitrates, header applications are typically small enough (in terms of in- struction count) to be implemented in software. In general, NP applications require a high proportion of memory instructions (relative to the ALU instructions), highlighting possi- ble limitations in the degree to which parallelism can be used to increase NP performance. Even for a PE design with zero latency local SRAM, the need to access external control and packet memory will significantly affect PE utilisation. It is clear that high levels of parallelism can only ensure future performance gains with corresponding improvements in both bus and memory technologies. Without significant improvements in these compo- nents, high degrees of parallelism will quickly result in PE under-utilisation. In the case
of the configurations examined in this chapter, 8-16 PEs is found to be the optimum for header based applications while up to 32 PEs can be configured in parallel for payload based functions.
While memory access latency and bus contention represent two external sources which reduce PE utilisation, the effect of branch instructions within NP applications represent an ‘internal’ PE performance limitation. Both the analysis in this work and previous work- load analyses have highlighted the high percentage of conditional operations within NP applications. For a deeply pipelined PE the effect of these conditional operations is to result in a large amount of wasted cycles after only a short period of time. Unlike general purpose systems which have input sources as varied as network interfaces to keyboards, an NP platform operates on packets only, with the same application remaining in place for long periods of time. With this is mind it should be possible to minimise the amount of processing cycles lost due to branch operations. By taking into account some network traits, it is believed that prediction methods, specific to PEs, should be able to significantly reduce this branch penalty.
Branch Prediction in Process Engines
6.1 Introduction
Following on from the workload and branch analysis presented in Chapter 5, this chapter presents a detailed examination of existing branch prediction schemes when applied to network workloads. In each case, the existing predictor architectures were implemented as simulation models within the SimNP simulator. In Chapter 3 the metrics used to eval- uate branch prediction architectures were discussed. In general it is possible to model branch prediction at a relatively high functional level, with branch prediction evaluation and analysis well suited to the SimNP platform. Examining existing prediction schemes it is found that no current method fully exploits the unique nature of NP applications, pro- viding scope for a new NP-specific prediction mechanism. Whereas existing prediction schemes aggregate branch history via a number of saturating counters in order to guide future predictions, the field-based scheme proposed in this chapter attempts to incorporate a number of NP specific traits as a means of improving prediction performance. This new field-based prediction architecture is described before a detailed performance evaluation of the scheme is presented. Design considerations such as prediction rate, silicon area and latency are examined and the field-based scheme is found to outperform existing predic- tion schemes in terms of prediction hit rate while requiring a similar amount of area as traditional schemes. In all cases the results presented in this chapter where obtained by
the author via simulations based on either the SimpleScalar/ARM or SimNP simulators and using the network traces used in previous chapters.