22-06-2013, 03:21 PM
Improving the Performance of Shared Memory Communication in Impulse C
Improving the Performance.pdf (Size: 301.67 KB / Downloads: 36)
Abstract
With the evolution of field-programmable gate arrays
(FPGAs) to the Million-Gate scope, high-level languages are
gaining popularity in electronic system design, which greatly improves
design and verification efficiency. Impulse C is a high-level
language widely used in software/hardware (SW/HW) codesign
and provides users with varies SW/HW communication mechanisms.
But the communication mechanisms of Impulse C are
mainly designed for versatility, and the resources within the FPGA
chip is not fully utilized. In this letter, we present a improved
implementation of the shared memory communication in Impulse
C by utilizing both ports of the dual-port BRAM. Experiment results
show that the improved implementation can greatly improve
the performance of shared memory communication, and further
improve the execution efficiency of hardware processes.
INTRODUCTION
WITH the development of deep submicron technology,
millions of gates can be integrated on a single fieldprogrammable
gate array (FPGA) chip, the design of large scale
application systems using FPGA becomes possible. Flexible
software modules and high-performance hardware modules
are usually combined to implement sophisticated high-performance
embedded systems [1], [2]. Traditionally, FPGA-based
hardware modules are designed either by hardware description
languages, such as very high speed integrated circuit hardware
description language (VHDL) and Verilog HDL, or by
GUI-based approaches in which function units are specified by
functional blocks. These design methodologies have two major
problems. First, system designers are intensively involved in
the design process and proficiency in hardware description
languages is mandatory, so the methodologies cannot scale to
the design of complex application systems that typically utilize
millions of gates. Second, traditional design methodologies
are hard to meet the needs of software/hardware (SW/HW)
codesign and coverification.
THE IMPROVED SHARED MEMORY COMMUNICATION
In this section, we first present the original hardware architecture
of the shared memory communication of Impulse C.
Then the modification to the hardware architecture of the original
shared memory communication mechanism is introduced.
The implementation of SMCI and the integration of SMCI to
the CoDeveloper development environment is detailed as well.
A. The Original Hardware Architecture and Its Limitation
The original hardware architecture of the shared memory
communication of Impulse C is illustrated in Fig. 1. External
memory and internal BRAM are all shared memory, and Impulse
C requires that all memory modules must be connected to
OPB. If a software process that runs on the PowerPC core tries
to access the shared memory, it has to access processor local
bus (PLB), OPB, and “Opb BRAM if cntlr” sequentially (the
dashed line in Fig. 1). Hardware processes are custom IP Cores
represented as PLB slave and OPB master. A hardware process
must use the shared memory access controller “Opb dma” to
access shared memory via OPB and “Opb BRAM if cntlr”
(the dotted line in Fig. 1). The “Communication Interface” is
the connection between PLB and user logic. Two unidirectional
communication channels that implement the Impulse
C signal mechanism are designed to make a duplex channel
to synchronize software processes and hardware processes.
Hardware processes get the base address of shared memory via
data stream.
CONCLUSION AND FUTURE WORK
In this letter, we present an improved implementation of the
shared memory communication in Impulse C by utilizing the
dual ports of BRAM. A new shared memory interface controller
is designed and integrated in the CoDeveloper environment,
which is transparent to application designers. Experimental
results on the Xilinx PowerPC platform show that the shared
memory access performance and the execution efficiency of
hardware processes are greatly improved, which only introduces
very small resource overhead.
With the improved communication performance, the results
of HW/SW partitioning may be different (e.g., more components
could be implemented as HW since there is more memory
access bandwidth available). We would like to study the interplay
between communication performance and HW/SW partitioning
in our future work.