12-11-2012, 04:15 PM
A Multi-Core Sphere Decoder VLSI Architecture for MIMO Communications
A Multi-Core Sphere Decoder.pdf (Size: 351.36 KB / Downloads: 29)
Abstract
The sphere decoding algorithm finds applications in
multi-input multi-output (MIMO) decoding, because it achieves
near maximum likelihood (ML) detection performance with
significantly reduced computational complexity. Previous work
has focused on implementations based on K-best or depth-first
search, limiting the BER performance or the search speed. This
paper presents a scalable multi-core sphere decoder architecture
that can combine the advantages of the K-best and depth-first
search methods. The proposed architecture demonstrated a 3-5
dB improvement in the BER performance for 16×16 systems
using 16 processing elements (PEs) compared to the architecture
with one PE. An improved search speed of the multi-core
architecture also enables a 10× energy efficiency improvement
over the single core architecture for the same data rate.
INTRODUCTION
MIMO technology is able to increase the spectral efficiency
for high-speed wireless connectivity. Data streams
are transmitted and received through multiple transmit and
receive antennas, creating independently-faded replicas from
the same signal source as well as independent data streams
from multiple sources. By constructively combining these
signals at the receiver end, more reliable data transmission
and higher data rates can be achieved. For MIMO decoding,
ML detection provides the optimum BER performance, but
its exponential computational complexity makes it unattractive
for practical implementation. The sphere decoding algorithm
is able to achieve the ML detection performance with reduced
computational complexity by only examining the possible
solutions within the search radius. Theoretically, the sphere
decoding algorithm has cubic computational complexity with
respect to the number of antennas, which is much lower than
the exhaustive search with exponential complexity [1].
EXISTING SPHERE DECODER ARCHITECTURES
The sphere decoding algorithm can be used to reduce the
exponential problem of finding the ML solution to a polynomial
tree-search problem. The ML decoding is complex,
because in the worst-case it requires solving an integer leastsquares
problem. This is much harder than the standard leastsquares
problem, where the solution can be found by pseudoinverse
[1]. The ML decoding can be greatly simplified by
solving an equivalent problem of finding the shortest path in
a tree topology within a limited radius. In addition to MIMO
decoding, similar concepts apply to other problems such as
multi-user detection for CDMA systems [5].
Depth-First Algorithm
The depth-first algorithm [2][3] starts the search from the
root of the tree, Fig. 2(a), and explores as far as possible
along each branch. Then, it back-traces until either a leaf
node is found or the node is outside the search radius.
The numbers indicate the number of processing cycles. In
hardware implementation, depth-first is realized in a foldinglike
architecture, Fig. 2(b), because only one node is visited
at a time during the tree search process. A candidate list also
enumerates the next search node when the path is beyond
the search range or reaches the leaf node. Ideally, the search
process continues until all possible nodes within the search
radius are examined. The ML solution is guaranteed to be
found if the solution is within the initial search radius and the
whole search space is examined.
C. K-Best Algorithm
The K-best algorithm [4]-[8] approximates a breadth-first
search by keeping only K best branches (with the smallest
partial Euclidean distance) at each level. Figure 3 shows the
algorithm and the basic architecture of K-best. After M cycles,
the path with the smallest Euclidean distance is chosen as the
best estimate. To process K data paths at the same time, parallel
architecture is applied. Since no trace-back is needed, it
can be realized as a multi-stage architecture without feedback
loops, Fig. 3(b). In general, more than one candidate node
is examined for each branch, so sorting circuits are inserted
between stages to keep the K best branches. Strictly speaking,
K-best does not take the advantage of radius-shrinking of the
sphere decoding algorithm. Once the ML solution is lost in the
decoding process, it can never be recovered in the following
stages.
MULTI-CORE ARCHITECTURE
The scaling of CMOS has been enabling the integration
of more devices and higher-frequency operation. This benefit
can be exploited to increase the number of processors and
the number of processing cycles. In light of these advantages,
we propose a multi-core architecture to resolve the limitations
faced by the existing architectures. The proposed architecture
has multiple PEs to speed up the search as well as flexibility
to examine more solutions when more processing cycles
are allowed. It is, therefore, suitable for the next-generation
MIMO systems with large antenna array sizes. The basic idea
of the multi-core is to search multiple nodes within the search
radius to speed up the search of admissible solutions. Unlike
the parallel architecture reported in [10], each PE has the
ability to search the whole search space rather than only the
branches with a common ancestor node.
Low-Power Operation
The operation of the multi-core architecture is flexible to
improve error performance or power consumption according
to the size of the search space and the throughput requirement.
For a large search space resulting from larger antenna arrays
and constellation sizes or a higher throughput requirement,
the system objective is to examine as many nodes as possible.
Therefore, all PEs are operated at a high clock frequency to
speed up the search. Conversely, the multi-core architecture
can reduce power consumption by slowing down the clock and
distributing computation over several PEs for a small search
space. This can be viewed as system-level extension of the
concept of parallelism, which is a well known technique in
digital CMOS [16]. The power consumption of the two-PE
can be 64% lower than that of one-PE by taking advantage
of the supply voltage scaling [16]. More PEs allow a more
aggressive voltage scaling such that the power consumption
can be minimized. Table III lists the performance for different
numbers of PEs. The 16-PE architecture provides a 10×
energy efficiency (∝ Csw ·V 2
DD) improvement over the singlecore
architecture [17].
CONCLUSION
A scalable multi-core VLSI architecture for the sphere
decoding algorithm is presented. We proposed a tree-partition
search scheme based on SE enumeration to partition search
paths across multiple PEs. To coordinate all PEs, the choice
of interconnect network and required data-interleaving was
discussed. The architecture with 16 PEs improves the BER
performance over the single PE architecture by a 3-5 dB. It
could also operate in a low-power mode by slowing down the
operating frequency and lowering the supply voltage for a 10×
improvement in energy efficiency.