23-08-2012, 12:29 PM
A 0.18μm VLSI Technology Based 64 Points Fast Fourier Transform Kernel
1VLSI Technology.pdf (Size: 399.97 KB / Downloads: 33)
Abstract
In this report, we present a thorough VLSI
implementation of a 64-point FFT/IFFT IP core with signed
fixed-point 16-bit word length accuracy, primarily for IEEE
802.11a wireless Local Area Network applications. Such a kernel
could also be integrated into a vast range of modern Imaging
Radar Systems and Real-time Signal Processing Systems. On
algorithm level, our 64-point FFT is accomplished by
decomposing itself into a 2-D structure of 8-point FFTs.
Compared with traditional radix-2 64-point FFT, such a
mechanism greatly reduces the work load of the complex
multiplier unit and results in much better system performance
with respect to processing speed and power consumptions, etc.
Complex multiplication operations are realized by shifters and
adders with double precision and no RAM cell is required for
coefficient storage. Our proposed FFT kernel is based on 0.18 μm
CMOS technology, simulated in Synopsys VCS environment and
is compiled and synthesized in design_vision environment.
Simulated core area of the chip is 2.0 mm2. Dynamic power
consumption is 15mW at 68 MHz operating frequency and 1.8V
of power supply voltage. To sum up, our design greatly
outperforms the original target specifications and our FFT
kernel’s overall performance is satisfactory.
INTRODUCTION
In most of today’s wireless communication standards,
Orthogonal Frequency Division Multiplexing (OFDM) is used
in order to cope with the multipath fading wireless channel.
OFDM is based on the Fast Fourier Transform (FFT), which is
computationally intensive especially with large number of
inputs. On algorithm level, the complexity of FFT is
represented as O(N log N). As a result, baseband processors
are required to equip with a dedicated FFT processing unit that
is both fast and low power consuming. Power is of primary
importance due to mobility requirement in wireless receivers
and many more handset real-time signal processing devices
and imaging devices.
In this work, we have chosen a particularly low-power FFT
unit from the literature and implemented it in RTL. The FFT
unit is that of [1] which only requires 23 clock cycles to
compute, and occupies only 6.8 mm2 core area. Compared to
other hardware FFT implementations, the work of [1] offers
the most attractive specifications for wireless communication
applications and many other applications in signal processing
as well.
Pipelining vs. Parallel Working
Notice that in such architecture, we allocate pipelining and
parallel working units in an evenly distributed manner, rather
than just sharing one physical functional unit and leaving
everything else to pipelining register bank.
Actually, there is another competing proposal for FFT
implementation in which there’s only one butterfly unit
integrated, and a super register bank takes care of the
pipelining work load in a very delicate manner. Yet after
discussion, we decided that such a proposal would very likely
be a bad idea, since (1) from thermal analysis point of view: it
might work pretty well for 16-points FFT Unit, yet for 64-
points FFT with such a mechanism, huge percentage of work
load will fall on the pipelining unit alone and makes it very
hot meanwhile the rest of the core is quite cool, we think this
is one of the circuit design pitfalls that we should try to avoid.
(2) It may not scale well: the work load and complexity of the
pipelining unit accumulate dramatically when we later
integrate implemented IP cores to form more complicated
cores.
Computing Accuracy
Previously, our multiplication functionalities are carried out
with 16 bits of accuracy, which is the same word length of
actual data passing through the FFT kernel. Yet simulations
show unsatisfactory errors of the core when compared with
expected outcomes from MATLAB7.0 simulator.
For such a problem, we doubled the bit length of each word
after it enters a complex multiplication block, and then
truncated the 32 bits of word back to 16 bits before output port.
With such mechanism, enhanced accuracy turns out to be
quite satisfactory. Further demonstration will be given in
Testing and Verification chapter that follows.
SUMMARY AND CONCLUSIONS
In this paper, we have described the design and
implementation of a serial 64 point FFT suitable for wireless
and modern signal/image processing applications. We
described the modular design in register-transfer level (RTL),
and synthesized and optimized our modules using Design
Vision. We verified our design at various stages, namely at the
RTL level and post-synthesis. We used golden model test
benches where MATLAB was used to generate valid
input/output vectors. Then the input vectors were applied to
the FFT and the output was compared. We demonstrated that
our processor passed the functionality tests with more than
64,000 data points.
Our FFT chip operates well beyond the target frequency of
20 MHz and occupies only 2.0 mm2 in a 0.18μm process.
Once the serial data is in the FFT unit, only 23 clock cycles
are required to produce the output. Therefore, the FFT can be
computed in less than a microsecond.