Informe de Gestión del ejercicio terminado el 31
7 Información para inversores
8.5 Riesgos Financieros
The previously simulated machine does not incorporate virtual paging system. Embedded systems frequently do not incorporate virtual paging, whereas most PC operating systems do support virtual paging, which can significantly impact memory access distribution.
In Section 3.2.1.4, a preliminary virtual paging system is introduced and implemented into the simulator. Two page allocation algorithms, sequential allocation and random allo- cation, are implemented. With sequential allocation pages are allocated sequentially from the lowest address to the highest address. Random allocation allocates pages across all available physical pages completely randomly. Impacts of virtual paging system with these two page allocation algorithms on SDRAM address mapping are hereby studied.
The simulated machine for virtual paging studies has the same configuration as the baseline machine shown in Table 3.1, except for the size of main memory. 2GB main memory is replaced by 512MB, which consists of two ranks of eight 256Mbit technology (32Mx8) SDRAM devices. The reason of reducing main memory size is to limit the size of the page table. The virtual page size is 4KB. Page swapping is unnecessary because 512MB memory is large enough for all simulated benchmarks and only one benchmark is simulated at a time.
Figure 4.14 shows the average execution times of simulated address mapping techniques under different virtual paging systems. Execution times are normalized to page interleaving with no virtual paging to show both the impacts of address mapping techniques and virtual paging system. When virtual paging is absent, bit-reversal reduces the execution time by 14% over page interleaving. With the sequential allocation virtual paging, bit-reversal reduces the execution time by 6%. However, when the random allocation virtual paging is used, all simulated address mapping techniques including the flat show less than 2% performance difference.
4.4 Address Mapping Working under Other Techniques 87
Page Permu Intel925 Rank BitRev
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Normalized Execution Time None
Sequential Random
Figure 4.14: Address mapping techniques under virtual paging systems
translation, depending upon the page allocation algorithm used. A random allocation al- gorithm places pages randomly and thus evenly over all SDRAM banks. Therefore only spatial locality inside virtual pages is preserved. Any spatial locality above the size of vir- tual pages (4KB) is completely destroyed by the random page allocation algorithm, leaving little performance improvement space to SDRAM address mapping.
Bit-reversal address mapping has been shown working well with no virtual paging as well as with virtual paging using sequential page allocation. SDRAM address mapping technique has little effects on performance with the random allocation virtual paging system. Page placement in a modern operating system will likely fall somewhere between the random and sequential allocation. Therefore SDRAM address mapping, especially the bit-reversal, will still be able to improve performance, although the performance gain contributed will be less significant and non-deterministic dependent upon the actual page placement.
An intelligent page allocation algorithm that is aware of main memory structure and nonuniform access latency could achieve a better performance than the sequential or ran- dom allocation. Obviously that requires the incorporation with the compiler and/or the operating system, and will be a part of future work of this thesis.
Chapter 5
Access Reordering Mechanisms
Conventional memory controllers serve memory request in the same order as they received. While the in order scheduling is easy to implement, it is obviously inefficient considering the nonuniform characteristics of main memory and the fact that modern processors always have a set of accesses to choose from. Access reorder mechanisms attempt to schedule outstanding memory accesses in an order that will increase row locality, therefore resulting in a reduced overall execution time. The proposed burst scheduling is presented, which clusters row hits into bursts to maximize the utilization of SDRAM buses. Design space of burst scheduling is exploited and optimizations are proposed. The performance of burst scheduling is examined and compared with existing access reordering mechanisms. Finally the combination of access reordering mechanisms and SDRAM address mapping techniques is studied.
5.1
Philosophy of Burst Scheduling
In a packet switching network, data are encapsulated in packets which are commonly com- posed by header and payload (data). The effective throughput of the network is usually less than the theoretical network bandwidth because a fraction of the bandwidth is used
Access0 Access1 Access2 Access3 Access4
P0 R0 C0 C1 C2 C3 C4
Burst (payload) Overhead
Figure 5.1: Creating bursts from row hits
to transmit packet header. One way to increase the throughput is to use large packets. Because the fraction of the overhead due to packet header reduces as packet sizes increase, large packets usually result in high throughput.
Consider a main memory access, if the bank precharge, row activate and column access transactions are considered as the overhead of a packet, and the actual data transaction is considered as the payload, then the above theory about packet switching network can also be used in memory access scheduling, which is a larger payload will result in a higher data bus utilization.
The data transaction size of each memory access is usually equal to the lowest level cache line size, so a large payload can be created from multiple data transactions from different accesses. As show in Figure 5.1, accesses that are directed to the same row of the same bank are selected from all available outstanding accesses and clustered together to form a burst. With a OP controller policy, data transactions of the accesses inside a burst can be performed on back to back cycles, resulting in a large payload and a high data bus utilization. The size of bursts can grow as newly arrived accesses join existing bursts which are being scheduled. The larger a burst is, the higher data bus utilization it has.
One drawback of using large packets in a packet switching network is the slow response time. Especially when variable sized packets are allowed, small packets will experience long latency if they follow large packets. This issue becomes worse in burst scheduling where the size of a burst can increase dynamically. Starvation may occur to small bursts when new
5.1 Philosophy of Burst Scheduling 91
(a) Without burst interleaving
(b) With burst interleaving
Access0 Access5 Access6 Access7 Access1 Access2 Access3 Access4
Access0 Access1 Access3 Access5 Access2 Access4 Access6 Access7 Burst A of Bank0 Burst B of Bank1 Burst C of Bank2
Figure 5.2: Interleaving bursts from different banks
accesses keep joining a burst being scheduled. The solution is to interleave bursts.
As shown in Figure 5.2, three bursts are created from eight accesses directed to three different banks. Without burst interleaving, access5, access6 and access7 that arrive later join burstA and are scheduled earlier than the accesses in burstB and burstC, resulting in long latencies to those older accesses of small bursts, as illustrated in Figure 5.2(a). BurstB and burstC could be starving if burstA keeps increasing. To reduce the latency to old accesses and prevent starvation, three bursts are interleaved as shown in Figure 5.2(b). Latency of older accesses (access1 to access4) are reduced. While burst interleaving does not affect the data bus utilization, it allows different sized bursts from unique banks to be served in relatively equal opportunity, preventing starvation.
Burst interleaving needs to be performed carefully as bubble cycles may be introduced. For example, DDR2 devices require a rank-to-rank turnaround cycle to be inserted between two data transactions from different ranks [30]. Therefore interleaving bursts between dif- ferent ranks will cause significant rank-to-rank turnaround cycles and degrade the perfor- mance. Also, read accesses and write accesses usually have different profiles. Additional timing constraints, such as tW T R as shown in Table 2.1, need to be met when mixing reads
by burst scheduling.
Existing access reordering mechanisms, including the row hit scheduling and Intel’s out of order scheduling, also attempt to combine row hits to exploit row locality and improve bus utilization. However, their row hit first policy is more like a best effect in creating a large burst. There is no guarantee that data transactions of selected row hits are transferred in back to back cycles.