29-04-2014, 12:25 PM
A VLSI Inner Product Macrocell
VLSI Inner Product.pdf (Size: 203.99 KB / Downloads: 20)
Abstract
Microcontrollers for embedded computer applica-
tions require a library of dedicated macrocells for specific applica-
tions. Arithmetic and basic digital signal processor (DSP) compu-
tations may be too inefficient when computed by software on the
core central processing unit (CPU) of the microcontroller. Here
it is defined and developed the architecture of a VLSI macrocell
for the ST9 microcontroller (8 bits), for the computation of the
inner (scalar) product of two vectors of integer numbers based
on the multiply/accumulate algorithm. The arithmetic core of the
macrocell is an integer pipeline. This macrocell fully interfaces
to the ST9 environment and is optimized so as to achieve the
maximum performances compatible with the bandwidth of the
bus of ST9 and the minimum consumption of silicon area.
The macrocell is implemented in CMOSM5H technology (0.7
channel width) and its performances, measured in terms of silicon
area and throughput, are evaluated.
INTRODUCTION
MICROCONTROLLERS for embedded computer appli-
cations require a library of dedicated macrocells for
specific applications. Among the most frequently required
functions there are digital signal processor (DSP) and image
processor (IP) algorithms, for telecommunication applications.
Such algorithms are based on few types of fundamental
arithmetic computations: addition, scaling (i.e., multiplication
by a constant), multiplication, discrete convolution, inner
(scalar) product of vectors and matrix product. Some nonlinear
operations may be required, too (e.g., comparison), normally
reducible to arithmetic operations. They work in integer arith-
metic, as data (samples) are frequently of integer type at
the source and, moreover, floating point algorithms are easily
reducible to algorithms working in integer arithmetic.
Arithmetic computations may be too inefficient when done
by software on the core central processing unit (CPU) of
the microcontroller. A dedicated macrocell must then be
integrated in the microcontroller. In this paper it is defined
and developed the architecture of a VLSI macrocell, for the
ST9 microcontroller (see Fig. 1) [1], [2], dedicated to the
computation of the inner product of two vectors of integer
numbers, based on the multiply/accumulate algorithm.
Specifications
The size and the type of the data processed by the VCU must
be programmable: the elements of the source vectors
can be 8 or 16 bits unsigned or signed (two’s complement)
integers, in all possible combinations, and the result of the
inner product must be represented over 32 bits, as an unsigned
or signed integer depending on the type of the operands. All
these arithmetic situations frequently occur in the applications.
can span over the whole memories or register
The vectors
file of the core CPU, with arbitrary base and stride; this is
also required for performing efficiently matrix multiplication.
Finally, the VCU must detect overflow, suspend and resume.
The VCU is connected to the core CPU of the microcon-
troller as a peripheral unit, and is programmable by means
of a file of 16 registers of 8 bits, allocated in the peripheral
addressing space of the core CPU (see Table I).
CONCLUSION
Comparisons with SW solutions prove that the VCU has a
far higher time efficiency. For instance, in comparison with
the best known SW routine written in ST9 machine language,
for the computation of the inner product of two vectors with
elements of 8 bits (unsigned or signed, but not mixed) located
in memory, the VCU exhibits a temporal speed up of a factor
19. For vectors located in the register file it is possible to reach
a temporal speed up factor of about 50. A further improvement
occurs when the two vectors are of heterogeneous arithmetic
type; in this case the VCU handles directly the multiplica-
tions of mixed unsigned/signed factors, because it contains
a suited parallel multiplier, whereas the machine language
of ST9 requires sequences of instructions to perform such
multiplications, missing an appropriate instruction.
It must be noted that the placement and the routing of the
layout has been performed automatically. By manual optimiza-
tion the consumption of silicon surface should decrease. A
downscale of about 50% is considered possible.
Future research may concentrate on the further optimization
of the pipeline stages, like for instance reduction and accumu-
lation, and on upscaling the VCU for larger vector elements
(32 bits).