3.2.1 The Power Wall
In the past, the ever increasing performance of microprocessors was mainly achieved through two techniques: raising the processor’s clock rate and increasing the number of transistors. Higher clock rates result in higher performance without any modifications of the processor’s microarchitecture and thus without the need to re-compile or even modify applications to exploit the increased performance resources. Yet, high clock rates have significant implications on the power consumption of a microprocessor. In general, the power consumption p of a processor can be approximated by
p = c · v2· f. (3.1)
with c being the capacitive load, v the voltage, and f the switching frequency of the transis- tors [153]. The capacitive load mainly depends on the number of transistors and the technology used, whereas the switching frequency is a function of the clock rate. Compared to the voltage, which contributes to the processor’s power consumption quadratically, the influence of the clock rate does not seem to be severe at first sight. However, increasing the clock rate and thus the switching frequency of the transistors also requires increasing the voltage in order to allow for stable operation. The clock rate therefore is the parameter with the highest impact on the power consumption of a microprocessor.
Although power consumption is critical for mobile computers that run on battery, for work- stations and servers problems do not arise from the power consumption itself but from the associated heat dissipation. For desktop computers and workstations the heat can be dissipated by airflow. Though not very costly or technically challenging, the cooling fans required for this task induce noise which can be annoying for the user in an office environment. In large server installations, however, more elaborate and expensive cooling techniques are required to dissipate the heat produced by the microprocessors. Typically the same amount of energy is
3.2. MULTI- AND MANY-CORES 33
required for cooling the computer system, that has previously been fed into the processor in form of electrical energy. With steadily increasing prices for electrical power, the operational costs over a server’s lifetime can easily exceed its purchase price.
At some point – for which the expression power wall has been coined – the power con- sumption does not allow for increasing the clock rate any higher for both, economical as well as ecological reasons. A prominent example is Intel’s Pentium 4 processor series that reached an all-time high at 3.8 GHz with the introduction of the Prescott core (based on the Netburst microarchitecture). The thermal design power (TDP) of this processor has been specified with 115W which had to be dissipated by the processor’s die surface of only 109 mm2. Tejas, the planned successor to the Prescott processor, had been canceled by Intel as it would have reached TDP values of over 150W.
3.2.2 From Single- to Multi-Core
Since the power wall prevents the microprocessor manufacturers from steadily increasing the computational power by increasing the clock rate, they have to concentrate on the second technique: increasing the number of transistors. Fortunately, Moore’s law, which states, that the number of transistors in a microprocessor doubles every 18-24 months, still holds. Before the power wall had been hit, these additional transistors have mainly been used to increase the processor’s cache and to increase the microarchitecture’s complexity by adding more – either specialized or redundant – functional units to the processor core. Though bigger caches can increase the performance of many applications by buffering accesses to frequently used memory regions, they do not increase the processor’s performance itself. They only alleviate the impact of slow accesses to the main memory, the performance of which grew at a substantially lower rate in the past years than the performance of the microprocessors.
Having several independent functional units within a processor core allows for parallel ex- ecution of multiple instructions, so-called instruction level parallelism (ILP). The processor exploits ILP transparently, i.e., the concurrent execution of machine instructions is hidden from the programmer who needs to just provide sequential code. This requires elaborate techniques like pipelining, out-of-order execution, multiple-issue, dynamic scheduling, and speculative exe- cution to name but a few. These techniques increase the microarchitecture’s complexity tremen- dously as they induce additional administration overhead. Furthermore, data as well as control dependencies that cannot be resolved by the microprocessor often prevent parallel execution. However, data and control dependencies can often be detected during compile time and hence the compiler can reorganize the resulting machine code in order to facilitate ILP.
Besides exploiting instruction-level parallelism, redundant functional units in a micropro- cessor can also be used to exploit data level parallelism, i.e., to execute the same operation on multiple input values concurrently. While it is difficult for the processor to detect and exploit data level parallelism automatically, it is typically much easier for the compiler. For example, adding two vectors of length n requires n add operations that are independent from each other and can therefore be executed in parallel. As this computation is typically implemented as a
loop over the vectors, the individual iterations can be easily identified by a compiler to be data parallel and hence be marked for parallel execution. For that purpose, all superscalar micro- processors provide SIMD (single instruction multiple data) instructions that operate on vectors instead of scalars. The best known examples for SIMD instruction sets are the MMX and SSE extensions of Intel’s and AMD’s x86 processors and the AltiVec instructions as implemented in PowerPC processors. These instructions typically operate on vectors of two or four integer or floating point values and hence allow for parallelization at a very fine grained level. In case of the above vector example, two iterations of the add loop can be fused into a single SIMD operation and be executed concurrently. Since the data and control dependencies are resolved automatically by the compiler this approach requires no additional circuits within the processor and no interaction by the programmer.
Although parallelism can be hidden from the programmer up to a certain degree by the techniques discussed above, many algorithms and programs exhibit a more coarse grained type of parallelism that requires different approaches in order to be exploited. Server applications for example have natural parallelism among the queries that are submitted by client applica- tions. These queries are usually independent from each other and can therefore be processed concurrently. This higher level parallelism is known as thread level parallelism (TLP) as it can be processed in separate threads of execution. Note that the term ”thread” in this context does not necessarily correspond to the concept of a thread in operating systems terminology. Each thread of execution can either be executed as a separate process or as a separate thread (i.e., a light-weight process).
TLP can be used to increase the utilization of the functional units of a superscalar micropro- cessor: simultaneous multithreading (SMT) allows for interleaving the individual instructions of multiple threads into a single stream that can be processed in parallel by applying the tech- niques for exploiting ILP listed above. The processor presents itself to the operating system as a multiprocessor (with two or more logical processors) which enables the operating system to schedule threads or processes on the logical processors for parallel execution like on any other (real) multiprocessor system. Since the instructions of one thread are independent from any other thread’s instructions, SMT can help to keep the processor’s pipeline filled when it would stall otherwise, e.g., in case of unresolvable data or control dependencies, or while waiting for memory accesses or I/O operation to finish. Intel introduced its SMT implementation Hyper- Threading Technology (HTT) in 2003 with the Pentium 4 and claimed that HTT increases the performance of many applications up to 30% but only requires 5% of the processor’s die size [93]. Although SMT can obviously increase the performance of microprocessors at little cost, it is only a tool to keep the complex superscalar processor core busy. The straight-forward approach for exploiting TLP certainly is executing the individual threads in parallel on multiple pro- cessors. However, while multiprocessor systems are common in HPC and for enterprise server infrastructures, the desktop and workstation market has been dominated by single processor machines. Yet, this changed with the introduction of chip multiprocessors (CMP), which are better known as multi-core processors: the steadily increasing number of transistors available on a processor die (thanks to Moore’s law) not only allows for increasing the number of functional