28-03-2012, 03:50 PM
Power Minimization of Functional Units by Partially
Guarded Computation
Power Minimization of Functional Units by Partially Guarded Computation.pdf (Size: 89.06 KB / Downloads: 153)
ABSTRACT
This paper deals with power minimization problem for datadominated
applications based on a novel concept called partially
guarded computation. We divide a functional unit into two parts –
MSP (Most Significant Part) and LSP (Least Significant Part) -
and allow the functional unit to perform only the LSP
computation if the range of output data can be covered by LSP.
We dynamically disable MSP computation to remove unnecessary
transitions thereby reducing power consumption. We also propose
a systematic approach for determining optimal location of the
boundary between the two parts during high-level synthesis.
Experimental results show about 10~44% power reduction with
about 30~36% area overhead and less than 3% delay overhead in
functional units.
Keywords
Low Power, Partially Guarded Computation
1. Introduction
Recently, electronics systems market has proliferated rapidly
toward portable computing and communication systems thereby
increasing demands for considering low power during VLSI
design [1, 2]. From the viewpoint of long battery life and high
reliability, power dissipation has become one of the major
objectives during synthesis procedure. In CMOS circuits, most of
the power dissipation is caused by charging and discharging load
capacitance of gates. Therefore, it is crucial to minimize the
number of signal transitions in circuits for low power design.
In high-level synthesis domain, there have been quite a few
studies devoted to minimize transitions in functional units,
registers, multiplexers, and buses [3 - 11]. Many of them focus on
minimizing transition activity in functional units because they are
the main source of power dissipation in data dominated
applications [3 - 8]. The most effective method to reduce the
number of transitions in functional units is increasing the
correlation of input data. Therefore, many of the previous work
focus on increasing input data correlation by changing operation
binding [3, 8], loop pipelining [7], loop interchange, operand
reordering, operand sharing, unrolling [5], and guarded evaluation
[11].
In this paper, we propose yet another technique which we call
partially guarded computation. The technique disables a part of a
functional unit based on dynamic range of input operands. We
divide a functional unit into two parts – MSP (Most Significant
Part) and LSP (Least Significant Part) - and allow only the LSP
computation when the range of input operands is covered by the
range of the LSP. For the division of a functional unit, we propose
a systematic method that finds the location of boundary that
maximizes power reduction. We also propose an effective
operation and operand binding algorithm for high level synthesis
in order to maximize the effect of the proposed technique.
This paper is organized as follows. In section II, we present the
basic concept of partially guarded computation and explain the
application to adders and multipliers. Section III proposes an
algorithm for dividing a functional unit into two parts. In section
IV, we propose an effective operation binding algorithm for
maximizing power reduction by our partially guarded
computation technique. We show experimental results in section
V and conclude our work in section VI.
2. Partially Guarded Computation
2.1 Basic Concept
In designing signal processing applications, we determine the
word lengths of functional units based on dynamic range of input
data such that maximum range of the data does not exceed the
word length of functional units. However, real data is generally
limited to small range in most cases, and the case of maximum
range rarely occurs. Figure 1 shows our motivational example
which is composed of a short segment of a speech data and
associated range information. As shown in Fig. 1 (b), the
maximum range of the speech data is 14 bits. However, about
60 % of the data do not have range larger than 8 bits. The extra 6
bits belong to sign extension region of the data. Generally, we
don’ t need to perform expensive computation for the sign
extension region because we can compute sign bit by looking at
only the LSB side. If we assume that the speech data is used as
one input of an adder and dynamic range of the other input does
not exceed 8 bits at all, we don’ t need to perform computation for
the sign extension bits during 60% of total execution cycles. By
performing only 8 bit addition for such cases, we can reduce
unnecessary transitions in the sign extension part of the adder
thereby reducing power consumption. Disabling the computation
of the sign extension part can be achieved by preventing
propagation of input data transition to the sign extension part of a
functional unit.
(a) (b)
Fig. 1. Example speech data and associated range.
Figure 2 illustrates an RTL implementation of our partially
guarded computation and timing diagram for the circuit operation.
2
We assume that input registers are reserved for each functional
unit, where the registers play the role of guard latches for guarded
evaluation technique [11]. We divide the output data of a
functional unit into two parts - MSP and LSP – such that the
range of the output data does not exceed the maximum range of
LSP in as many cases as possible while keeping large range of
MSP. The bit lengths (or maximum ranges) of MSP and LSP are
denoted as NM and NL, respectively. Corresponding to the division
of the output data, the functional unit is also divided into two
parts: MSP and LSP. The functional unit performs computation in
only the LSP when the detection logic signals that the range of the
output data will not exceed NL. Otherwise, the functional unit
performs computation in both MSP and LSP. To disable the
computation in MSP, we use guarded evaluation technique by
inserting guard latches to the inputs of the MSP. The inputs are
composed of the MSB side of primary inputs and carry inputs
propagated from the LSP. When the functional unit performs only
LSP computation, sign extension logic produces correct output
data by extending sign bit from the output data of LSP.
Detection Logic CLK Reg 1 Reg 2
CLK
CLK
SELA
SELB
input1 input2
output
MSP LSP
latch-C
latch-P
Sign Extension Logic
Load
(a) RTL implementation
C LK
L oa d
in p u t1
SE L A
SE L B
in p u t2
o u tp u t
(b) Timing diagram.
Fig. 2. RTL implementation of partially guarded computation
circuitry and timing diagram.
The output signal of detection logic, which is denoted as SELA,
is connected to the enable signal of guard latches. The inputs of
the detection logic are not connected to the outputs of input
registers but to the inputs of input registers to have SELA asserted
before the next input data is loaded into the guard latches, which
is crucial for correct guarding. However, SELB, which is
connected to the sign extension logic, is asserted after the rising
edge of clock. This is to guarantee the output of the functional
unit to be loaded correctly into the input register of other
functional units. The detection logic asserts zero when both
input1 and input2 generate output data ranges not exceeding NL.
The detection logic can be implemented simply by using NM input
AND gates or NM input NOR gates1 and D-latches.
To reduce unnecessary power dissipation in detection logic, we
implement the circuit such that it is enabled only when the register
load signal is enabled. When the detection logic is disabled, it
holds the previous value of SELA by using internal latch. Since it
takes time for the input1 and input2 to be stable, there may occur
glitches on SELA. There are two types of glitches (1 ® 0 ® 1
glitch and 0 ® 1 ® 0 glitch) according to the initial state of
SELA. The first type glitch (1 ® 0 ® 1) does not induce much
power dissipation because 1 ® 0 and 0 ® 1 transitions in SELA
generate no transitions in the already enabled MSP. However, the
second type (0 ® 1 ® 0) causes unnecessary transitions in the
previously disabled MSP. Therefore, we remove the second type
glitch by delaying SELA until CLK’ s rising edge when the value
of SELA is 1. In this case, SELA may be asserted after input data is
loaded. However, it does not break our guarding scheme because
SELA needs to be asserted before CLK only when SELA changes
from 1 to 0.
2.2 Implementation of Functional Units
Figure 3 illustrates the implementation of partially guarded
circuitry for ripple carry adder. Since there may occur overflow
when we add two numbers, we set the input bit length of LSP as
NL-1. Sign extension logic can be simply implemented by using
multiplexers.
FA FA FA FA FA FA FA FA
1 0
D-latch
1 0 1 0
Input MSP Input LSP
SELB
Sign Extension Logic
Output MSP Output MSP
NM+1 NL-1
NM NL
D-latch D-latch D-latch
Fig. 3. Partially guarded circuitry for ripple carry adder.
Figure 4 illustrates the implementation of partially guarded
circuitry for signed array multiplier. We insert the guard latches
between the MSP and LSP of the full adder array. Let N1
S and
N2
S be the bit length of sign extension region for input1 and
input2, respectively. The bit length of the sign extension region of
the output is computed as N1
S+N2
S. If it exceeds the range of
MSP, then we disable the latches and the MSP and activate only
the LSP. However, dynamically extracting the values of N1
S and
N2
S from the input patterns and computing the values N1
S+N2
S to
see if it exceeds the range of the MSP require too much
computation resulting in high circuit overhead and power
consumption. To avoid such problem we disable the MSP only
1 We need to check leading successive one’ s and zero’ s for both
input1 and input2.
3
when N1
S and N2
S independently exceed their own pre-determined
bounds N1
M and N2
M, respectively. There can be one or more
combinations of bounds N1
M and N2
M such that N1
M+N2
M=NM. In
the next section, we present an algorithm that determines the
values of N1
M and N2
M from the statistics of the inputs such that
the resulting power reduction is maximized. Once the values are
determined, the guard latches are inserted between
(NM=N1
M+N2
M) th bit position and (NM+1) th bit position from the
MSB. The sign extension logic is implemented in the same way as
ripple carry adder. We use two types of multiplier FA’ and FA,
where FA’ is a slightly modified version of FA for correct
computation of sign bit [15].
N1
N L 1
M
Input1 MSP Input1 LSP
FA’ FA FA FA
FA’ FA’ FA FA
FA’ FA’ FA’ FA
FA’ FA’ FA’ FA’
FA’ FA’ FA’ FA’
Output MSP Output LSP
D-latch D-latch D-latch D-latch
NL=N1
L+N2
NM=N L 1
M+N2
M
D-latch
D-latch
D-latch
Input2 MSP Input2 LSP
N2
N L 2
M
0
0 0 0 0
1 0 1 0 1 0
SELB
1 0 Sign Extension Logic
Fig. 4. Partially guarded circuitry for signed array multiplier.
Due to the overhead of the augmented circuitry, we do not
expect much power reduction in the ripple carry adder. However,
in case of the array multiplier, we can obtain much power
reduction because it contains much more computing elements
than the adder.
3. Boundary Positioning Algorithm
3.1 Problem Formulation
We denote the two inputs of a functional unit as in1 and in2,
respectively and assume they have the same bit length which is
denoted as N. The j th bit of ini is denoted as ini. The MSB and
the LSB of ini correspond to ini
0 and ini
N-1, respectively. We
denote the location of the boundary between the MSP and LSP of
input data ini as Di. Di has a value between 0 and N-1. If j < Di,
ini
j belongs to the MSP. Otherwise, it belongs to the LSP. Note
that, in case of multiplier, there can be one or more combinations
of D1 and D2 satisfying D1 + D2 = NM. We formulate the boundary
positioning problem as follows:
Given streams of input data to a functional unit, determine
the set of boundary location DI’ s such that power reduction
by partially guarded computation is maximized.
We obtain the input data streams to the functional unit by
performing behavioral simulation with the simulation vectors
given by the user.
3.2 Algorithm
In the case of multiplier, the bit length of output MSP (NM) is
computed as D1 + D2. Therefore, there can be more than one
combinations of boundary points D1 and D2 for the same value of
NM. In this section, we consider the case where only single
combination of D1 and D2 is selected. Extended problem for
finding multiple combinations of D1 and D2 will be treated in the
next section. We solve the single boundary positioning problem
by selecting an optimal combination of D1 and D2 which gives the
largest power reduction among all possible combinations. We
denote the power cost per single operation of the MSP of a
functional unit Ft and the power cost per single operation of
detection logic as P1
t,M(D1, D2) and PDET(D1, D2), respectively.
We denote total power reduction in the MSP of Ft as Pt,M(D1, D2)
which is computed by accumulating P1
t,M(D1, D2) whenever the
MSP is disabled. The total power reduction in Ft, which is
denoted as Pt(D1, D2), is computed as
Pt(D1, D2) = Pt,M(D1, D2) - b×I×PDET(D1, D2) (1)
where b and I represent weighting factor and the size of input data
stream, respectively. The value of b is determined experimentally.
We compute Pt(D1, D2) for all combinations of D1 and D2. In the
case of adder, D1 and D2 must have the same value because the
range of the output MSP is determined by the minimum of D1 and
D2. Therefore, there are only N combinations. However, in the
case of multiplier, D1 and D2 may have different values because
the range of the output MSP is computed as the sum of D1 and D2.
Therefore, there are N´N combinations of boundary points. We
define the power cost of the MSP as weighted number of full
adder cells in the part.
1. PositionBoundary
2. INP: input data stream;
3. begin
4. Initialize Pt(D1, D2) and Pt,M(D1, D2) to 0;
5. for each input data from INP loop
6. for D1 =0 to N-1 loop
7. for D2 =0 to N-1 loop
8. if both D1 and D2 are sign bits then
9. Pt(D1, D2) += P1
t,M(D1, D2);
10. end if;
11. Pt(D1, D2) -= b×PDET(D1, D2);
12. end for;
13. end for;
14. end for;
15. Sel_D1 = Sel_D2 = 0;
16. for D1 =0 to N-1 loop
17. for D2 =0 to N-1 loop
18. if Pt(D1, D2) > Pt(Sel_D1, Sel_D2) then
19. Sel_D1 =D1 ; Sel_D2 = D2 ;
20. end if;
21. end for;
22. end for;
23. end;
Fig. 5. Pseudo code of boundary positioning algorithm.
4
The power cost PM
ADD(D1, D2) of ripple carry adder is defined
as
1 2 1 2
1
, P (D ,D ) D D ADD M ADD ADD =a × =a × (2)
where aADD is a weighting factor to reflect the effect of sign bit
transition2 whose value is obtained experimentally. The power
cost of the detection logic is computed as
PDET(D1, D2) = D1+D2 (3)
The power cost P1
MULT,M(D1, D2) of array multiplier is defined as
P1
MULT,M(D1, D2) =aMULT ×(D1+D2-1)×(D1+D2)/2
if (D1+D2) £ N
P1
MULT,M(D1, D2) = aMULT×(N×(N-1)-(2N-D1-D2)×(2N-D1-D2-1)/2)
otherwise (4)
Figure 5 shows the pseudo code of our boundary positioning
algorithm. In the first loop including the loops nested inside, we
compute Pt(D1, D2) for all combinations of D1 and D2. We
accumulate P1
t,M(D1, D2) to Pt(D1, D2) when both D1 and D2
belong to the sign extension regions of the current input data
whereas b×PDET(D1, D2) is subtracted from Pt(D1, D2) for every
input data. After we obtain Pt(D1, D2) in the first loop, we select
D1 and D2 which induce the largest value of Pt(D1, D2) in the
second loop. The complexity of the algorithm is O(N2×I) , where I
is the number data in the input data streams.
Figure 6 illustrates an example of 16 bit array multiplier in 11th
FIR filter, where b is assumed to be 0.4. After we apply the
algorithm in Fig. 5, we obtain the 16 ´ 16 table PMULT(D1, D2)
whose value corresponds to z-axis in the figure. The table
indicates that power reduction is maximized if we choose 3 and
12 as D1 and D2, respectively.
FRVWGDW
PDWUL[
063ELW RI ,QSXW
063ELW RI ,QSXW
&RVW
Fig. 6. Example of Boundary Positioning.
2 We give more weight when there is signal transition in sign bit.
3.3 Extension to Multiple Boundary
Combinations
If we check multiple combinations of boundary points, we can
increase the duration in which MSP of the multiplier is disabled.
Though checking multiple boundary points helps reducing MSP
power, it increases detection logic overhead. However, overhead
in detection logic is not linearly proportional to the number of
boundary points because there are lots of common components
among detection logics with different boundary points.
The problem of determining multiple boundary points is
formulated as selecting nD combinations of D1,i and D2,i such that
total power reduction is maximized and D1,i+D2,i=NM for i=0, ... ,
nD-1. We adopt the same scheme for positioning multiple
boundary points as proposed in the previous section. We
exhaustively evaluate all the candidate solutions using the cost
function PMULT(D1,i, D2,i) and select the best one. If we assume nD
is given by user, the complexity of the multiple boundary
positioning algorithm can be computed as ÷ ÷
ø
ö
ç çè
æ
× ×
D n
N
M N , where
the worst case complexity is O(M×N1+N/2) which is almost
intractable. However, from the experimental results, we find that
nD=3 is sufficient in most cases because we cannot obtain
significant power improvement even if we increase the value of nD
larger than 3. The graph in Fig. 7 shows the effect of the number
of boundary points on power improvement. Note that the curves
begin to saturate from the point where nD =3.
Fig. 7. Effect of multiple boundary points on power
improvement.
4. Operation Binding Algorithm
Since partially guarded computation technique reduces power
consumption based on dynamic range of the input data to a
functional unit, obtainable power improvement strongly depends
on the status of operation and operand binding. In general,
increased input correlation reduces power consumption in a
functional unit. However, it does not always hold when the
functional unit supports partially guarded computation. For
example, let us compare two input data sequences: sequence A
(111101 ® 101110) and sequence B (111101 ® 000010). From
the viewpoint of input data correlation, sequence A is superior to
B. However, if we use partially guarded computation technique,
sequence B can result in lower power design than sequence A.
Note that the boundary point of sequence A is located at the 4th bit
5
from the MSB, whereas that of sequence B is located at the 1st bit.
Thus, we need to devise a new operation and operand binding
algorithm that can consider tradeoff between data correlation and
dynamic range.
For this purpose, we propose a greedy algorithm that
incrementally binds an operation to a functional unit while
evaluating the power cost of the current design. After all the
operations are scheduled, we perform operation binding such that
power cost is minimized. We compute the power cost by
multiplying power reduction ratio to the estimated power of the
current design obtained using DBT model [16]. The power
reduction ratio can be computed using equation (1) ~ (4). We
perform operation binding for each control step starting from
initial control step. For each control step, we compute power costs
of all the feasible bindings for each operation scheduled at current
control step. Among the feasible bindings we select the one with
minimum power cost. We repeat above procedure until all the
operations at current control step are bound. When we compute
power costs of all the feasible bindings, we consider swapping of
input operands and select a solution with minimum cost.