Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Multicore Architectures With Dynamically Reconfigurable Array Processors
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Multicore Architectures With Dynamically Reconfigurable Array Processors for Wireless Broadband Technologies

[attachment=34527]

Abstract

Wireless Internet-access technologies have significant
market potential, particularly the Worldwide Interoperability for
Microwave Access (WiMAX) protocol which can offer data rates
of tens of megabits per second. A significant demand for embedded
high-performance WiMAX solutions is forcing designers to seek
single-chip multicore systems that offer competitive advantages
in terms of all performance metrics, such as speed, power, and
area. Through the provision of a degree of flexibility similar to
that of a DSP and performance and power consumption advantages
approaching that of an application-specific integrated
circuit, emerging dynamically reconfigurable (DR) processors are
proving to be strong candidates for processing cores in future highperformance
multicore-processor systems. This paper presents
several new single-chip multicore architectures for the WiMAX
application based on recently emerging coarse-grained DR processor
cores. A simulation platform is proposed in order to explore
and implement various multicore solutions combining different
memory architectures and task-partitioning schemes. This paper
describes the different architectures, the simulation environment,
and several task-partitioning methods and demonstrates that up
to 7.3 and 12 times speedup can be achieved by employing eight
and ten DR processor cores for both the WiMAX transmitter and
receiver sections, respectively. A comparison with other WiMAX
multicore solutions is given in order to demonstrate that our best
solution delivers a high throughput at relatively low area cost.

INTRODUCTION

BEING the commercial name of IEEE 802.16 family of
standards [1], WiMAX, standing for Worldwide Interoperability
for Microwave Access, can provide up to tens of
megabits per second symmetric bandwidth over many kilometers.
This gives WiMAX a significant advantage over other
alternatives like Wi-Fi and digital subscriber line. Applications
such as WiMAX demand high performance, strict low power,
and in-field reprogrammability in order to follow the evolving
standard. Traditional single-core architectures fall short of
meeting all these requirements since in the past few years,
people have not seen great gains but instead diminishing returns
in processor performance through increasing operating frequency
(made possible via deeper pipelining) and instruction- level parallelism (ILP). It is known that the development of
single-core processors hits three walls: memory wall, ILP wall,
and power wall [2]. Therefore, there is a need for distributing
the processing load of complicated applications over multiple
processors. Meanwhile, continuously shrinking process technology
allows for more transistors to be integrated into one
single chip. As a result, computer architects are able to build
more complicated designs such as multicore processors.

WiMAX Implementations

Today, many WiMAX implementations have emerged based
on different technologies including ASICs, field-programmable
gate arrays (FPGAs), and multicore processors. Wavesat
DM2563 [12] is a high-performance ASIC solution, while
Altera provides FPGA solutions forWiMAX based on Stratix II
FPGAs [13]. In our previous work [14], a multitasked version
of the WiMAX PHY has been mapped onto a single DR architecture
with a real-time operating system (RTOS)—μC/OS–II.
As for multicore solutions, Intel’s NetStructure WiMAX
Baseband Card is based on the Intel IXP2350 network processor
which has four integrated programmable microengines and
one Intel XScale core [15]. While both Freescale and picoChip
use the multicore DSP approach, FreescaleMSC8126 multicore
DSP [16] consists of four SC140 extended cores and a Viterbi
coprocessor. PC102 [17] (from picoChip) is a tiled architecture
which consists of 308 16-b heterogeneous processors in
three different variants and various hardware accelerators. Each
processor is connected to a deterministic interconnect associated
with a time-division-multiplexing-based interprocessor
communication protocol. In [18], the PC7218 reference design
employs two PC102 chips for fixed WiMAX PHY processing.
The authors of [19] implemented WiMAX on a Cell processor
where only five SPEs were used with the other three idle. In
[20], WiMAX is ported to a Sandbridge SB3010 processor
which integrates one ARM9 reduced instruction set computer
(RISC) core and four Sandblaster DSP cores.

RELATED WORK

Multicore Architectures in the Embedded Field

In embedded systems, multicore architectures have begun
playing an increasingly important role. Many multicore-based
processors including DSPs and microcontrollers have emerged,
such as Cell Broadband Engine Architecture [9] and Ambric
[10]. A Cell processor is composed of one PowerPC processor
element as a controller and eight synergistic processor elements
(SPE) for most of the computational workload. The Cell processor
aims to greatly accelerate multimedia and vector-processing
applications. The Ambric chip consists of 360 32-b processors
and 360 1-kB SRAM banks. It is composed of a repeated basic
building block which contains eight processors and 13 kB of
SRAM. All processors communicate through First-in, First-outbufferlike
channels which are unidirectional and point to point.
Until now, there have been few multicore-processor projects
based on coarse-grained DR processors: One such example
is the work on developing a multicore processor based on
the Montium tile processor (TP) from Recore Systems [11].
However, to the best of our knowledge, currently, Montium TP
does not support development from high-level languages, and
its homogeneous processing-part array is not efficient in terms
of hardware-resource occupancy.

Custom-Cell Support

The existing DR tool flow provides full support for the inclusion
of both combinatorial and synchronous custom instruction
cells through the simulator libraries. As shown in the customcell
generation environment of Fig. 3, the function descriptions
of the custom cells such as MULTBK_REG_FILE and SBUF
cells are written in C++ via template classes provided by
the simulator. A fully automated system generator compiles
the standard simulator libraries together with the custom cell
C++ model and the timing and area information attained by
synthesizing the custom-cell Verilog model. A custom MDF
and a custom simulator are generated to replace the standard
MDF and simulator used in the standard tool flow. Both shared
register files and stream buffers have been synthesized with
Faraday memory compiler Memaker using the UMC 0.18-μm
process technology which the test DR chip is based on. The
generated timing, area, and power information is used in the
DR tool flow for scheduling and simulation purposes.

MAPPING METHODOLOGY

For running multiple tasks on the MRPSIM simulator,
the mapping methodology shown in Fig. 8 is developed.
The methodology extends the work in [8] and incorporates
profiling-driven task partitioning, task transformation, looplevel
partitioning, and memory architecture-aware data mapping
in order to reduce the overall system execution time. The
methodology also allows the designer to explore the different
implementations on the proposed multicore-architecture platform.
A task-level interface is added to ease the multiprocessing
programming. In [8], both theWiMAX transmitter and receiver
were considered together in the partitioning process. However,
in this paper, a more practical approach was followed where a
full duplex mode is used, and the partitioning of the transmitter
and receiver are carried out separately. In addition, in this paper,
the profiling-driven partitioning is extended to support heterogeneous
architectures as well as homogeneous architectures.
Therefore, the resource mix for each processing core in the
system is allowed to differ and may be tailored to the particular
tasks that it is intended to execute. Through this method, the
multicore processor model can support both homogeneous and
heterogeneous DR-processor-based multicore architectures.

TASK-DRIVEN ARCHITECTURAL CUSTOMIZATION

After mapping tasks on different processing cores, an architectural
customization can be performed on each core in terms
of customizing the instruction-cell array and the memory architecture,
according to the assigned tasks. When the workload
is balanced, for achieving area optimization, an instructioncell-
array customization can remove those redundant instruction
cells. The customization will introduce a trivial impact
on the execution time and, hence, still keep balance between
processing cores. For example, there are no multiplication
operations required by those functions in the channel-coding
task. Therefore, it is unnecessary to keep costly multiplier cells
in the array for the channel-coding task. In this case, area
optimization aims to maximize area efficiency without breaking
the workload balance or worsening the throughput. Second, an
instruction-cell-array customization can introduce some new
instruction cells to support user-customized operations, such
as MULTBK_REG_FILE cell used for accessing shared register
files. Table VIII provides the configuration of each core
customized for WiMAX tasks.

CONCLUSION

This paper has proposed several multicore processor architectures
based on coarse-grained DR processors, targeting
WiMAX applications. A flexible multicore-processor simulation
platform has been developed in order to evaluate various
multicore solutions that combine different task–partitioning
strategies and memory architectures. This simulation platform
includes a SystemC-based trace-driven multiprocessor
simulator and a profiling-driven mapping methodology. The
simulator provides a diversity of architectural parameters
and performance-targeted information, delivers fast simulation
speeds, and maintains timing accuracy. The utilized methodology
incorporates task partitioning, task transformation, and
memory-architecture-aware data mapping. Three task partitioning
methods, including task merging, task replication,
and loop-level partitioning, were addressed and used for a
BPSK-based fixed WiMAX application