25-08-2017, 09:32 PM
Using Embedded Multipliers in Spartan-3 FPGAs
Using Embedded.pdf (Size: 122.81 KB / Downloads: 23)
Introduction
Spartan-3 FPGAs have a number of features to fortify the chip’s arithmetic capabilities. Carry
logic and dedicated carry routing continues to be provided as in past generations. Dedicated
AND gates in the CLBs accelerate array multiplication operations. The newest and most
significant addition is the dedicated 18x18 two’s-complement multiplier block. With 4 to 104 of
these dedicated multipliers in each device, fast arithmetic functions can be implemented with
minimal use of the general-purpose resources. In addition to the performance advantage,
dedicated multipliers require less power than CLB-based multipliers.
The embedded multipliers offer fast, efficient means to create 18-bit signed by 18-bit signed
multiplication products. The multiplier blocks share routing resources with the Block
SelectRAM™ memory, allowing for increased efficiency for many applications. Cascading of
multipliers can be implemented with additional logic resources in local Spartan-3 slices.
Applications such as signed-signed, signed-unsigned, and unsigned-unsigned multiplication,
logical, arithmetic, and barrel shifters, two’s-complement and magnitude return are easily
implemented.
The 18-bit x 18-bit multipliers can be quickly created using the CORE Generator™ system, or
they can be instantiated (or inferred) using VHDL or Verilog.
Data Flow
Each embedded multiplier block (MULT18X18 primitive) supports two independent dynamic
data input ports: 18-bit signed or 17-bit unsigned. The two inputs are referred to as the
multiplicand and the multiplier, or the factors, while the output is the product. The MULT18X18
primitive is illustrated in Figure 1.
Timing Specification
The result is generated faster for the LSBs than the MSBs, since the MSBs require more levels
of addition, so timing specifications are different for each of the 36 multiplier outputs. Designs
should use only as many output bits as are necessary. For example, if two unsigned numbers
will never have a product of 235 or higher, the P[35] output is always zero. For any pair of signed
numbers of n bits, if you will never have -2n-1 x -2n-1, then the MSB is always identical to the
next lower-order bit (P[2n-1] = P[2n-2]). Also consider that if some outputs must have longer
routing delays, they should be put on the output LSBs to balance with the MSB delays.
For the same reason, the data input setup time for the pipelined multiplier will be shorter for the
MSBs than the LSBs, but the timing parameters do not differentiate between pins for setup
time. For additional safety margin in a design, slower inputs should be put on the MSBs. The
Reset and Clock Enable inputs have much faster setup times than any of the data inputs, and
all have zero hold times. The timing parameter name "tMULIDCK" (MULtiplier Input Data to
ClocK) is used for both the data and control inputs, but will have different values for each type.
Multipliers in the Spartan-3 Architecture
The multipliers are located adjacent to the block RAM, making it convenient to store inputs or
results in the block memory (see Figure 4). There are two or four columns of multipliers in each
device. Where there are two columns, they have two rows of CLBs between them and the edge,
allowing the multiplier to be easily driven by CLB or IOB logic. There are four CLBs, or 16 slices
and 32 LUTs, on either side of a given multiplier block, allowing 32 input and output signals to
be connected immediately adjacent to the multiplier block. One possible high-speed layout is to
put A[15:0] on one side, B[15:0] on the other side, and intersperse the P[31:0] outputs on both
sides. For a full-size 18x18 multiplier, the extra inputs and outputs can connect to the next CLB
column. For best performance, pipeline the inputs with registers in the adjacent CLBs.
Expanding Multipliers
Multiplication using inputs with more than 18 bits is possible by decomposing the multiplication
process into smaller subprocesses. The binary representation of either input can be split at any
point, provided the proper weighting and sign of the MSBs is taken into account. Splitting off the
18 MSBs of the input makes the best use of the 18-bit signed multipliers.
For example, Figure 5 shows how a 22x16 multiplier could be implemented. The 22-bit value is
decomposed into an 18-bit signed value and a 4-bit unsigned value from the LSBs. Two partial
products are formed. The first is a 20-bit signed product, which is the result of multiplying the
16-bit signed value by the 4-bit unsigned section. The second is a 34-bit signed product, formed
by multiplying the 16-bit signed value by the 18-bit signed section. The addition process
restores the weighting of the products (note the least significant bits of the first product bypass
the addition) and forms the final 38-bit product. Since the first product is signed, the 20-bit value
needs to be sign-extended before addition. The adder itself only needs to be 34 bits, requiring
17 slices.
Design Entry
There are many options for including the Spartan-3 multiplier in a design. The library primitive
MULT18X18 and MULT18X18S described earlier can be instantiated in the schematic or HDL
code. Synthesis tools can infer a multiplier block from the multiply operator, including Xilinx
XST, Synplicity Synplify, and Mentor LeonardoSpectrum. They will infer the MULT18X18S
when the operation is controlled by a clock for a synchronous multiplier.
LeonardoSpectrum features a pipeline multiplier that involves putting levels of registers in the
logic to introduce parallelism and, as a result, use CLB resources instead of the dedicated
multipliers. A certain construct in the input RTL source code description is required to allow the
pipelined multiplier feature to take effect. See the Synthesis and Simulation Design Guide for
more information.
System Generator
The Multiplier Generator is used by the System Generator for DSP when the MULT block is
used. System Generator presents a high level and abstract view of the design, but also
exposes key features in the underlying silicon, making it possible to build extremely highperformance
FPGA implementations. The System Generator also provides blocks for compiling
MATLAB® M-code into synthesizable HDL code. The System Generator uses the embedded
multiplier when a parallel multiplier is selected and the use of the dedicated multiplier is
checked in the System Generator interface.
MAC Cores
The CORE Generator system and the System Generator can also implement more complex
functions using the multiplier as a building block. The Multiply Accumulator (MAC) core
supports up to 32-bit inputs and optional user-defined pipelining. The options of an Embedded
or LUT Based implementation control whether the dedicated multipliers or CLB resources are
used for the function. The MAC implementation uses relatively few CLB resources beyond the
dedicated multipliers and provides flexibility that is key to matching a design to the lowest
density and lowest cost solution possible.
The MAC and MAC-based FIR filters include an automatic pipeline control which is based on
required system clock performance. Levels of pipeline will automatically be inserted based on
the design requirement for a perfect speed/area trade-off.