26-07-2012, 11:17 AM
DSP Architecture Optimization in MATLAB/Simulink Environment
DSP Architecture Optimization.pdf (Size: 1.47 MB / Downloads: 91)
Acknowledgments
I am sincerely grateful to my advisor Professor Dejan Markovic, without whose
help and support this thesis could not have been written. His constant drive for
perfection has taught me a great deal and I am truly indebted for the patience
he had with me on occasions when I faltered.
This work started out and has progressed, based on Dejan’s idea of designspace
exploration via architectural transformations. I wish to thank my group
memebers Chia-Hsiang Yang and Victoria Wang for the invaluable feedback I
received from them when working on this research project. Chia-Hsiang Yang
designed the sphere decoder which I have used as a design driver example. My
thanks also go out to the new members in my group, Sarah Gibson, Cheng-Cheng
Wang and Vaibhav Karkare for their insightful comments and discussions in the
group meetings. Professor Mani Srivastava and Professor Miodrag Potkonjak
provided helpful reviews of this thesis which helped me refine its contents.
This work could not have progressed without the crtical infrastructure support
provided by Synplicity Incorporated and Cadence Design Systems. I am grateful
to the Synplicity team for the tools they provided us and the training sessions
they conducted to help us learn them. I found the Source-Link database provided
by Cadence to be extremely useful for logic and physical synthesis flows.
I would like to thank Rohit for his unwavering support through the course of
this project. He has always had the utmost faith in my abilities, and has never
fallen short of words of encouragement, especially at times when I needed them
most. We have also had some very lively discussions when I was preparing for
the prelim examination. I am indebted to Nitesh Singhal, Abhishek Ghosh and
Bibhu Dutta Sahoo who were always ready to discuss and clear any doubt I had
before the prelim examination.
Abstract
Architectural optimization has traditionally been a heuristic process involving
multiple iterations before the design converges to the desired specifications. Multiple
architectures are difficult to evaluate if RTL is written repeatedly for each
design. The process becomes tedious if the design fails to meet target specifications
and changes need to be made at the system level. This work aims to automate
the process of architecture selection and provide energy-area-performance
optimal solutions staring from the graphical timed data-flow Matlab/Simulink
description for an algorithm.
Integrating functional blocks into system architecture requires the most efficient
use of available resources for maximizing area-efficiency. In essence, this
task is accomplished with scheduling. Improved power efficiency requires that all
pipelines in a design be balanced, necessitating retiming at the micro-architecture
level. Parallelism is employed when building high-throughput or low-energy
systems. These architectural transformations have been implemented in MATLAB/
Simulink environment, using Integer Linear Programming (ILP) models.
xi
This work proposes a modified ILP model which integrates scheduling and retiming,
providing a 33% average reduction in area-delay product compared to
existing ILP models which do not incorporate retiming. Also a 20× reduction in
worst case CPU runtime is achieved by the proposed method when compared to
existing ILP models which directly incorporate retiming with scheduling.
Optimization of complex structures is supported by hierarchically extending
circuit-level results from the underlying macros. The entire framework allows
comparison of various architectural solutions for a given algorithm in the
energy-area-performance space. The high-level block-diagram based description
in Simulink conveniently maps to FPGA or ASIC. The method is applicable for
dedicated DSP algorithms like Fourier transforms as well as hierarchical structures
with complex macros such as those in MIMO communications.
1.1 Motivation
Integrated circuit design is at an interesting juncture at present with continued
scaling of the underlying technology. The integration complexity has reached several
billion transistors, opening doors for the IC design market to capture a wide
variety of application domains. Cell phones, ipods, palm tops, biomedical instruments
are only few manifestations of this ever increasing phenomenon. However,
we are still faced with several challenges in the design process; one of which is
making a suitable architecture selection for a given algorithm. As systems become
more and more complex and constraints on energy, area and performance
become tighter it is not possible to single out a particular architecture as being
always optimal. Area is no longer the only optimization metric; the advent of
portable battery operated devices has made lower power consumption a more desirable
target. In fact the choice of the optimal architecture is strongly dictated
by system specifications such as throughput, area and power consumption. A
simple FIR filter could have several different realizations depending on the system
it is being integrated with. For example, neural signal processing requires
very slow operation at sample rates close to several hundred megahertz, in which
case the filter could be time-multiplexed. For a high speed base-band processing
unit in a wireless LAN system, the same filter would have to be parallelized to
1
meet throughput requirements.
The aim of this work is to make it feasible for algorithm designers to analyze
the hardware cost associated with their algorithms in the energy-areaperformance
space. This feedback helps designers in tuning their algorithms such
that system specifications can be met, and also in finding the optimal architecture
for a given algorithm. The choice of the optimal architecture is strongly dictated
by circuit-level energy-delay sensitivity results [1] of the underlying macros in the
design (Fig. 1.1). The interface between the micro-architectural level and the
circuit-level is not easy to navigate. This is because, if the design fails to meet
the target specifications at the circuit-level, then changes will have to be made at
the system level. This can mean major restructuring at the micro-architectural
level, like opting for a parallel design instead of a time-multiplexed one, which
then necessitates re-writing the RTL. The process becomes very tedious if this
design cycle re-iterates.
The answer to this problem lies in automating the generation of architectural
solutions (high-level models and RTL) for a given algorithm and also hierarchically
extrapolating circuit-level measurements like energy, area and delay for
these architectures from synthesis results of the underlying macros.
A tool which automates this flow will offer designers a convenient medium to
explore the solution space of possible architectures and also analyze the tradeoffs
in the energy-area-performance space. This work proposes such a tool embedded
in MATLAB/Simulink which is a very comprehensive and easy-to-use graphical
platform. SystemC based modeling was an alternative to Simulink for the implementation
of this tool, however this approach would involve C coding which would
limit the use of the tool to those adept in coding. Moreover, a graphical format
allows easy visualization and understanding of the architectural transformations
2
Figure 1.1: Algorithm-architecture-circuit-level interaction in the design-space.
as compared to when architectures are described using C codes.
The reference architecture (direct-mapped version of the target algorithm)
has to be defined only once in Simulink using pre-existing or user-defined library
blocks. By use of data-flow-graph models this reference architecture is mathematically
modeled by a set of matrices. The matrix based representation makes
it convenient to apply transformations like scheduling, retiming, pipelining and
parallelism. The transformed architectures are then converted back from matrices
to Simulink models, making it possible for designers to functionally verify
the new design. Modeling complex DSP kernels in a hierarchical manner is also
very convenient with this approach since data-flow-graph models are essentially
independent of the internal complexity of the processing kernel.
The optimization discussed in this work is not limited to making architectural
3
choices alone. Tools like Synplify DSP support automatic RTL generation from
Simulink descriptions. By synthesizing the RTL in backend tools like Cadence
or Synopsys, accurate circuit-level results for the generated architectures can
be obtained. Other refinements like usage of carry-save arithmetic and gate
sizing which are a part of logic synthesis can also be integrated in this flow.
Depending upon delay slack available in the design after synthesis, Vdd scaling is
also employed to improve energy efficiency.
The goal of this work is to establish a complete framework for evaluating
architectures based on results generated through actual physical synthesis and
not just heuristics. For example, in the case of time-multiplexed architectures, the
energy overhead associated with control circuitry cannot be estimated accurately
at the high-level. Logic synthesis of the final architecture on the other hand
gives a more clear picture of the degradation in energy-efficiency associated with
time-multiplexing.
1.2 Overview of Previous Work
The problem of exploring architectural solutions in combination with lower level
circuit optimizations has received constant attention in the past few years. Designers
have always known that optimizing only at the circuit-level has marginal
gains, while combination of architectural and circuit-level refinements produce
much improved results. This concept was clearly demonstrated in the paper by
Gemmeke et.al [2] where a design methodology for optimizing FIR filters was proposed.
The paper started out with a description of various architectural choices
like time-multiplexing and parallelism and moved on to arithemetic-level improvements
like Booth recoding of the filter coefficients. The last level of optimization
suggested was at the circuit-level which described sizing of the transistors for
4
minimum power dissipation at the specified throughput. The approach outlined
in this paper targetted the FIR filter design-space; in this work we try to create a
design environment which automates these optimization techniques for a generic
DSP algorithm. The tool Hyper [3] developed by the University of California,
Berkeley automates architectural transformations to enable efficient design-space
exploration. Given a flow-graph and a set of timing constraints the tool generates
the most area-efficient solution. In this work, we explore the design-space of
possible architectures to find the solution which is jointly optimal in the energyarea-
performance space.
Mapping of Simulink models onto ASIC was introduced in [4]. The work in
[5] added optimization heuristics and applied them to a complex singular value
decomposition algorithm [6]. Although systematic, architecture tuning was manual,
making the process impractical or time consuming for complex systems. This
work aims to automate the process in the convenient-to-use MATLAB/Simulink
graphical environment. The user selects tuning parameters which control the degree
of time-multiplexng, retiming, parallelism and pipeling at the Simulink level.
Based on these parameters the optimizer automatically generates the optimized
Simulink model. The core of this optimization process is in the efficient modeling
and transformation of the reference architecure.
Modeling of DSP algorithms as data-flow graphs (DFG) followed by scheduling
or retiming has received considerable attention in literature. Retiming was
first described by C.E. Leiserson and J.B. Saxe in 1983 [7], following which it
has perhaps become one of the most widely used techniques for improving delay
or lowering power consumption. The retiming problem is formalized using
Integer Linear Programming (ILP) models which are solved using branch and
bound techniques or shortest-path algorithms like Bellman-Ford [8]. Formal de-
5
scription of the ILP model is discussed in the paper by K.N. Lalgudi et.al in [9].
Retiming has been fairly well integrated in modern-day CAD tools (Synopsys,
Cadence retiming tools). However, synthesis tools can retime circuits only after
the structural netlist has been created once the technology mapping process is
complete. This gate-level granularity introduces large number of retiming variables,
making the process very computationally inefficient for larger designs. For
example, a 512-point FFT could not be retimed at the gate-level in a span of 2.5
days when using the RC compiler tool from Cadence. Performing retiming at the
gate-level for multiple architecture solutions can become quite tedious in such
a situation. The solution lies in hierarchically decomposing the top level design
into smaller modules which can be retimed quickly. Extrapolating circuit-level
results from the smaller modules makes it possible to get fairly accurate estimates
of the critical path post-retiming for the top level design. From these estimates a
final design can be fixed upon based on system constraints. Hence, only the final
design would have to be synthesized. This approach for hierarchical retiming
was adopted in the high-level synthesis tool IRIS [10]. However, the scheduling
approach used in IRIS was not optimal and also it lacked support for pipelining
and parallelism.
Scheduling is the formal approach for time multiplexing a finite set of operations
onto a core of processing elements. If the design is subject to speed
constraints, the scheduling algorithm will attempt to parallelize [11] the operations
to meet timing constraints. Conversely, if there is a limit on the cost (area
resources), the scheduler will serialize operations to meet resource constraints.
Scheduling thus determines the cost-speed tradeoffs of the design.
The scheduling problem is NP hard in general. Various heuristic and formal
approaches to the scheduling problem have been covered extensively in literature.
6
Approaches like ASAP (As Soon As Possible scheduling) [13], ALAP (As Late As
Possible scheduling) [14], list scheduling [15], MARS (Minnesotta Architectural
Scheduling) [36] are quick but sub-optimal ways to generate a schedule. Integer
linear programming models can formalize the process and obtain global optimum
solutions. An integer programming model for synthesizing digital logic at the
register-transfer level (RTL) was formulated in [9]. The model gives detailed specifications
for data-path synthesis such as variable storage, operation precedence,
resource sharing, and control structures. Due to the complexity of the formulation
this approach cannot be extended to complex structures. Integer programming
approach was also proposed for microcode scheduling in CATHEDRAL-II [16],
which is a synthesis engine for multiprocessor DSP systems. After a customized
data path has been synthesized and the high-level operations are mapped onto
a set of RTL operations, the microcode scheduling is performed. The model
contains data precedence, resource conflict, and controller pipelining constraints.
Since excessive CPU time is required to solve large problems, the model was
replaced by a graph-based scheduling algorithm [17] which used Integer Linear
Programming models. This approach formalized the scheduling problem and proposed
a way to achieve the most area or throughput efficient schedule. Retiming
of the original data-flow-graph was not integrated in all the above mentioned approaches
to scheduling. It has been shown later in the thesis (Chapter IV) that
retiming simultaneously with scheduling can result in more area and throughput
efficient schedules.
The benefit of retiming with scheduling to improve on the resource utilization
was investigated in [19]. An iterative-improvement probabilistic algorithm
similar to simulated annealing was developed in this work which applied transformations
like retiming, associativity and commutativity to improve the area of
the scheduled architecture. However, the issue of improving the throughput of
7
the scheduled architecture by supporting greater pipeline depth in the processing
elements via retiming was not addressed. Retiming with scheduling was also investigated
in [18] where a scheme to generate all possible schedules for any DFG
was proposed. However, this scheme works only on strongly connected graphs
(a DFG where every node must be a part of a loop). It was also not clear in
this work as to how the most area- or throughput-efficient schedule was to be
extracted from the pool of solutions generated.
Retiming variables are unbounded in the integer space. Their inclusion in the
ILP model makes it practically impossible for the ILP to converge to a solution
for complex structures, when we attempt to minimize the area of the schedule.
In this work a new model has been developed for ILP scheduling which integrates
retiming without directly introducing the retiming variables in the ILP. Retiming
is done post scheduling using the Bellman-Ford shortest-path algorithm [8] which
has polynomial time-complexity. This ensures that the optimum schedule is found
without increasing CPU runtime excessively. We show that simultaneous retiming
of the original DFG along with scheduling produces better results in terms of area
and throughput when compared to the results in [17].