PROPOSICIÓN NO DE LEY
2.7.1 PREGUNTAS QUE SE FORMULAN
We conclude this section with an analysis of CPU trends in practice. Table 2.1 shows the evolution of Intel brand CPUs over the last decades. A few observa- tions can be made: Most notable is the enormous increase in both the number of transistors and the CPU bandwidth (i.e. throughput). CPU latency shows a steady decrease, while clock rates are going up. However, the latency and clock speed trends evolve at a much slower rate than transistor count and CPU throughput. This implies that most of the bandwidth gains do not come from improvements in sequential instruction throughput, but rather from an ever increasing amount of parallelism.
Back in 1965, Gordon Moore already predicted that the number of transistors per CPU would double every 2 years [Moo65]. These added transistors led to more and more complex cores, with all kinds of infrastructure being added to keep pipelines filled (i.e. wide-issue CPUs with reorder buffers and Out-of-Order execution), resulting in higher and higher sequential CPU throughput.
When parallelization techniques for processing pipelines started running dry, CPU vendors shifted focus towards increasing thread-level parallelism, by repli-
20 CHAPTER 2. HARDWARE OVERVIEW
cating partial or entire CPU cores. The consequence is that, besides exposing instruction level parallelism, application programmers now also become respon- sible for writing explicitly parallelized programs, to benefit from the added CPU power. In general we can claim that, the more complexity CPU vendors are adding over time, the higher the parallelism becomes. And consequently, the more difficult it becomes for programmers to utilize available CPU resources to their (near) theoretical maximum.
The following subsections provide a brief chronological overview of tech- niques introduced by CPU vendors to increase potential performance, together with some notes about their impact on application programmers.
Aggressive Pipelining
Table 2.1 shows a clear trend towards deeper (i.e. latency in clock cycles) and wider (i.e. issue width) pipelines. Increases in clock-speed are tightly related to the initial rise in the number of pipeline stages (“super-pipelining”). By splitting the processing of instructions into ever smaller stages, CPU vendors were able to keep increasing CPU frequencies. As a marketing strategy, this worked very well, with customers buying into ever increasing clock speeds. A notable peak in this trend was Intel’s Pentium4 Prescott architecture (not shown in the table). It had a 31-stage pipeline, accompanied by a maximum frequency of 3.8GHz.
The boosting of CPU speeds through deeper pipelines does not necessarily lead to equal speedups in program execution-times though. This only holds for “CPU-friendly” code. With deeper pipelines, the relative cost of pipelin- ing hazards increases as well, as more and more bubbles have to be inserted. Eventually, CPU vendors realized this, which is marked by a shift back towards shorter pipelines after Intel’s Pentium4. These 14-16 stage pipelines still incur a significant penalty for pipeline bubbles though, stressing the importance to write code that is free of hazards.
While Table2.1 shows a stabilizing trend with respect to pipeline length (i.e. number of stages), the issue width, or pipeline width (i.e. number of parallel pipelines), keeps increasing. Intel’s forthcoming Haswell architecture even pushes the number to eight parallel pipelines. For programs that exhibit high data- and instruction level parallelism, wider pipelines can contribute to improved instruction throughput (IPC). Furthermore, it enables more aggressive speculative execution. For example, the parallel pipelines could be utilized to execute both the if and else paths of an if-then-else construct, and then committing only that branch which turns out to have its condition satisfied. This way we utilize abundant CPU resources to avoid pipeline bubbles.
Given these trends of deeper and wider pipelines, application programmers and compiler writers are becoming more and more responsible to come up with clever ways to expose sufficient data- and instruction-level parallelism. Several such techniques were discussed earlier in this chapter.
Improved Lithography
Even though CPU vendors have moved back to shorter pipelines, clock speeds keep increasing. Stock Intel i7 CPUs currently ship at speeds up to 3.9GHz [Int]. Recently, chip-vendor AMD even announced breaking the 5GHz barrier with their FX-9590 chip [Adv]. These gains in CPU speed can be attributed mostly
2.4. CPU ARCHITECTURE 21 0.001 0.01 0.1 1 10 100 1000 10000 ’82 ’85 ’89 ’93 ’97 ’01 ’06 ’10 Latency clock (MHz) latency (nsec) 1/latency_nsec 1 10 100 1000 10000 100000 1e+06 1e+07 ’82 ’85 ’89 ’93 ’97 ’01 ’06 ’10 Bandwidth transistors (thousands) MIPS
Figure 2.7: Visualization of trends in processor evolution (logarithmic scales).
to improvements in photo-lithography (CPU manufacturing technology). Due to increasingly smaller transistor sizes (current Intel i7’s having 22nm wide circuit lines [Int]), the physical sizes of CPU cores keep decreasing. Smaller transistors and cores means less heat dissipation and lower power consumption. Also, the smaller a core becomes, the shorter the distances that electrical currents have to travel while flowing through the instruction pipeline. This makes it possible for clock frequencies to be incremented further [DGR+74].
However, as the left graph in Figure 2.7 shows us, the trend of exponential growth in clock frequencies is tapering off. This trend is likely to continue, as it is becoming increasingly difficult (and expensive) to scale manufacturing tech- nologies further down [AHKB00, FDN+01]. Although single-atom transistors
have been shown to be feasible [FMM+12], such technologies will be difficult to
make cost-effective for mainstream use. A recent estimate based on economic factors expects a minimum lithographic process of around 7nm [Col13].
On the other hand, comparing the CPU latency in the left graph of Fig- ure 2.7 with the bandwidth line on the right (measured in million instructions per second, or MIPS), we notice that bandwidth does not show any flatten- ing, increasing at a steady exponential rate. We can also see that bandwidth correlates nicely with the amount of transistors per chip. How CPU vendors utilize these transistors to keep improving CPU throughput is discussed in the following sections.
Simultaneous Multithreading
With the advent of simultaneous multithreading (SMT) [TEL95] in later models of Intel’s Pentium4, we see the start of a new trend in transistor utilization: rather than adding constructs that try to exploit instruction level parallelism (ILP) within a single thread of execution, entire parts of the CPU, are replicated to improve CPU utilization by allowing multiple threads of execution to run simultaneously on a single CPU, thereby increasing thread level parallelism
22 CHAPTER 2. HARDWARE OVERVIEW
(TLP). The most notable hardware changes required for SMT are the ability to fetch instructions from multiple streams within a single cycle, and larger register files to allow each thread to store its own program state.
To benefit from SMT, a software developer can not simply rely on smart compiler and hardware techniques to utilize the added processing power. There is some automatic benefit, in that other processes or the OS itself can now run on the additional “virtual” core, potentially leaving more resources available for the program being optimized. But in general, programmers now have to introduce explicit parallelism into their code, for example by using multiple threads of execution (i.e. multi-threading).
Multi-core CPUs
Initial multiprocessing systems were built as symmetric multiprocessing (SMP) systems, where two or more identical but discrete CPUs share the same main memory. Reduced transistor sizes, combined with a move towards smaller cores that allow for faster clock rates, has eventually led vendors to spend the ever growing amount of transistors on replication of entire processing cores on a single chip, leading to what we call multi-core CPUs, or chip multi-processors (CMP). Note that it is perfectly viable to combine SMP and CMP to build a system consisting of multiple multi-core CPUs, with each core often supporting SMT as well. CMP went mainstream in 2006 with Intel’s Core Duo, containing two cores, and varies between 2-6 cores in current Intel CPUs. In 2014, Intel is expected to release an 8-core Core i7 CPU based on its latest Haswell architecture. The corresponding server line, named Xeon, is expected to go as far as 15 cores.
For computation intensive tasks, we can even detect a shift towards many- core processors, which have an order of magnitude more cores than the typical 2-8 found in regular desktop and server CPUs. These CPUs are aimed at the “super-computer” market, where a lot of computational power is needed to solve various simulation, modeling and forecasting style problems. Intel recently re- leased its Xeon Phi processor line, with up to 62 cores, 244 hardware threads and 1.2 teraFLOPS of computing power [Int12b]. An alternative can be found in Tilera’s TILE processor line [Til13], where op to 72 identical, general purpose 64-bit cores can be found on a single die.
With CPU latency crawling towards its physical limits, it can be expected that chip vendors will keep focusing on “innovation through increased CPU bandwidth”, by putting more and more cores on a single chip. This will put an increasing responsibility on developers to build software in ways that explicitly exposes such thread level parallelism. Such parallel programs are not only harder to write, they are also harder to optimize, as one needs to be able to solve problems like how to distribute work over (a varying number of) cores in such a way that potential contention points like communication buses and caches are utilized most efficiently. This topic is revisited in more detail Section 2.5.4, where this so called non-uniform memory access (NUMA), topology is discussed.
Heterogeneous Architectures
Increasing the number of cores by simple copying of existing cores is not the only approach taken by vendors. With hardware becoming more and more complex, common processing tasks are being delegated to specialized hardware.