0 0.2 0.4 0.6 0.8 1
Power (uW)
Frequency (GHz)
ideal DVFS Single supply Dual supply Triple supply Dual supply, dithering Triple supply, dithering
Figure 3.7: Pipelined Adder with register (or relay station) based distributed clock gating
reduce effective frequency, we are actually increasing internal activity; hence power increase. When we choose a schedule of duty less than 50%, we enable the pipeline only once after two or more clock cycles. Hence we are decreasing activity by prolonging withdrawal of disable signal. Thus we see a decline in power consumption. A sudden fall in power in dual and triple supply cases is due to change in supply voltage. Since schedules in range 2/3 to 1 are not used in the dithering cases, resulting curves do not show any peak. Nonetheless the power behavior is far from an ideal DVFS.
From this first set of experiments, we conclude that we can rule out distributed clock-gating mechanism based on register (or relay stations). Thus, for next set we only considered global clock-gating and latch-based distributed clock-gating.
3.5 A Pipelined NoC Switch
We evaluated power consumed by a pipelined switch that we designed for a NoC used as interconnect fabric for a GALS system. As it is shown in Fig. 3.8, our switch is connected to four other switches located on north, east, south and west. All the switches are arranged to form a mesh topology. Since the switch is also connected to a processing element or a memory block, it has in total five input and five output ports, each 64-bits wide. Flow control uses wormhole technique whereas routing uses the XY dimension ordered method. It has been proven that it avoids deadlock [15]. Elementary unit of
flow control is a 64-bit flit which can be sent or received in one clock cycle. Multiple flits (in variable number) form a packet. Figure 3.8shows the switch’s internal architecture which is derived from the architecture in [16] and consists of three pipeline stages. First stage is for link traversal and FIFO buffer write. Second one is for arbitration and routing pre-computation. And third one is for switch traversal (through a non-blocking crossbar). The pipelined nature of the switch makes it an ideal candidate for application of the methodologies described in the previous sections.
Arbitration Switch
Crossbar
3rd stage: switch traversal 2nd stage: switch allocation
1st stage: link traversal & FIFO write Input FIFO
Figure 3.8: A NoC Switch
DVFS controller of switch schedules pulses of a master clock via clock-gating to obtain an effective clock frequency. It selects appropriate supply voltage through pMOS headers. It interfaces through 64-bits links with a local computational resource and with four other switches (located on north, west, east and south), each potentially running at different effective frequency and voltage. Resynchronizing FIFOs and voltage level shifters are used to permit correct interfacing. Figure 3.8shows switch connections and reveals its internal structure that consists of three pipeline stages. First stage is for link traversal and FIFO buffer write. Second one is for arbitration and routing pre-computation. And third one is for switch traversal (through a non-blocking crossbar).
In global clock-gating implementation, scheduler disables clock simultaneously for all pipeline registers. In distributed case, scheduler applies its enable signal to third stage.
Local clock-gating controller of third stage forwards enable signal to second stage after one clock cycle and in turn second one sends it to first stage, after two clock cycles. In both global and distributed cases, input FIFO write pointer has an autonomous local clock-gating circuit to avoid dropping incoming input packets when scheduler disables clock. When switch is clock gated, an output invalidation mechanism prevents surround-ing switches and local resource from samplsurround-ing its outputs (the most significant bit of output data is used as validity bit).
DVFS controller schematized in figure 3.8 selects among only two voltages, Vhigh and Vlow. However, we evaluated power consumption also for case of three voltages, with and without dithering, other than the single voltage and the ideal DVFS case. We tested both global and distributed clock-gating and for this last case, we only considered the latch-based implementation, having previously ruled out relay station method.
We implemented this switch in 45 nm technology used for previous set of experi-ments. Procedure outlined in previous section allowed us to estimate clock frequency of 0.923GHz at the highest voltage of 1.15V. It is then possible to scale voltage down to 0.89V for an effective clock frequency half the value at 1.15V (0.923/2=0.4615GHz), and to 0.75V for an effective frequency one third of the highest value (0.923/3=0.3077GHz).
Trends in figure 3.9 and 3.10 are in accordance with our expectations; and similar to those of simple pipelined adder case. Results highlight though a less effective scaling of power at a single supply voltage with respect to previous case: Ratio between power at full speed and at zero activity is 2.5x whereas it was 6x for simple pipelined adder.
Especially in upper schedule range, from 2/3 to 1, scaling frequency is not really effective.
Main reason is larger impact of leakage power in this more complex circuit. Two voltage and three voltage cases without dithering suffer from same problem in that upper range, because voltage used for those schedules is still 1.15V. Once again, dithering is the best option resulting in power curves that track well with the ideal DVFS case. Difference between global and latch based distributed clock-gating is small. Since there is no significant overhead for slightly more complex distributed approach, we recommend use of this method.
Cost of controller is small 0.9% in power and 0.4% in area of the switch. Therefore our approach to DVFS is very cost efficient.
3.6 Summary
We have discussed a simplified DVFS mechanism that uses voltage dithering among few levels (two or three voltages) for dynamic voltage scaling, combined with a frequency scaling method based on clock scheduling. This method is intended for applications in GALS systems using an aggressive, per block power management strategy.
To validate our technique, we have designed some circuits in a 45 nm CMOS technology and characterized them with standard synthesis and power evaluation tools. We have compared power versus frequency curves with those obtained by an ideal DVFS system ( one that chooses voltage and frequency in a continuous range).
0 5 10 15 20 25 30
0 0.2 0.4 0.6 0.8 1
Power (uW)
Frequency (GHz) ideal DVFS
Single supply Dual supply Triple supply
Dual supply, dithering Triple supply, dithering
Figure 3.9: Switch with global clock-gating
0 5 10 15 20 25 30
0 0.2 0.4 0.6 0.8 1
Power (uW)
Frequency (GHz) ideal DVFS
Single supply Dual supply Triple supply
Dual supply, dithering Triple supply, dithering
Figure 3.10: Switch with latch based distributed clock-gating
We have evaluated two different implementations of clock scheduling. First method cou-ples clock scheduler with a global clock-gating circuit. Second one uses a distributed clock-gating approach. For this method, we have also evaluated two alternative im-plementations, register-based and latch-based. We have found that the register-based gating method is not effective in reducing power. Latch based distributed clock-gating and global method produce similar results and are able to track well with power versus frequency curve of an ideal DVFS system.