• No se han encontrado resultados

Barrio Santa Ana Barrio de la

3. Relaciones del recurso

Moore’s law led to rapid increases in the transistor budgets available to hard- ware designers. During the period of Dennard scaling, these extra transistors were used to add features to increasingly complex single core processors which were optimised for high clock speeds and single threaded performance. This is exemplified by Intel’s Netburst architecture, which was in use between 2000 and 2006 and featured in their Pentium 4 processors [59].

A large proportion of Netburst’s transistor budget went towards creating highly superscalar processors [78]. These processors featured extremely deep instruction pipelines up to 36 stages in length, along with extensive hardware to support the dynamic scheduling, speculative execution and instruction re- ordering required to keep these pipelines full. This approach became unsustain- able after Dennard scaling ended in 2006, when thermal problems caused by high power consumption led to the abandonment of the Netburst architecture and the end of the Pentium product line [98].

Hardware designers have since developed a number architectural features to improve the energy efficiency of their processors in the post-Dennard era. Each of the features listed below seeks to minimise some subset of the parameters in

the CMOS power equations listed above.

3.4.1

Multi-core Processors

The most notable change to processor design in recent years has been the in- troduction of Multi-core Central Processing Units (CPUs). Multi-core architec- tures replace the monolithic processors of the past with a collection of smaller, simpler interconnected cores. These cores operate independently while sharing access to common hardware like last level caches and main memory.

Smaller cores have fewer components and shorter wire lengths, both of which lead to reduced load capacitance and leakage current. Multiple cores also am- plify the effects of other energy efficiency features listed below.

Multi-core processors prioritise throughput over single threaded performance. Performance engineers have to deal with the overhead of parallelising their codes in order to see the benefit of this architectural approach.

3.4.2

Clock Gating

Of all the subsystems in modern processors, the clock tree has the potential

to be the most power hungry. Clock trees distribute the signal from a central clock across all areas of a processor, which inevitably means they have long wire lengths and high load capacitance. Furthermore, their activity factor is maximal by definition; circuits carrying the clock signal will change state with every clock cycle.

Clock gating reduces power consumption by disconnecting or gating those parts of the clock tree which are connected to idle logic. The activity factor of gated subtrees drops to zero, meaning they only incur leakage power costs.

Performance engineers can maximise the benefit of clock gating by batching similar operations together. In the case of multi-core processors, similar logic can also be pinned to particular cores. Both of these approaches result in longer idle periods for the effected subsystems, increasing the likelihood of clock gating.

3.4.3

Dynamic Voltage and Frequency Scaling

Equation 3.2 shows that dynamic power consumption grows quadratically with supply voltage. Small reductions in supply voltage therefore have the poten- tial to deliver significant reductions in power consumption. Unfortunately, the switching speed of transistors also decreases when they operate at lower volt- ages. This increases the propagation delay of CMOS logic, which can lead to timing errors if this delay exceeds the clock period.

Dynamic Voltage and Frequency Scaling (DVFS) gets around this issue by scaling supply voltage and clock frequency in tandem. Lower supply voltages are paired with slower clock speeds in order to give CMOS logic enough time to finish operating. These matched supply voltage and clock frequency pairs

are called P-States. DVFS allows processors to choose from a set of predefined

P-States based on their current workload.

DVFS has a cubic relationship with power consumption because clock fre- quency is also a parameter in Equation 3.2. Its relationship with energy is less obvious because reduced power consumption can be offset by longer runtimes. DVFS is most effective when performance does not depend on clock speed; a processor may enter lower power states while it waits for data, for example.

3.4.4

Heterogeneous Computing

Heterogeneous computing takes two main forms. The first, most common form of heterogeneity is the inclusion of accelerators or other special purpose hardware within compute nodes. These accelerators augment the capabilities of general purpose CPUs by allowing them to offload specific tasks.

A second form of heterogeneity involves building special-purpose compute cores directly into processors. ARM’s “big.LITTLE” concept is one example of this kind, in which smaller, more energy efficient cores are twinned with larger, more performant ones [97]. Work migrates between these cores as required to meet performance and energy efficiency targets. Advanced energy-aware

scheduling techniques are necessary to take advantage of this type of heteroge- neous architecture [123].

The Green500 list of energy efficient supercomputers is dominated by het- erogeneous architectures [114]. All but one of the top thirty machines in the November 2016 list make extensive use of accelerators, while the remaining ma- chine is based on a custom heterogeneous processor design [2].