30-11-2012, 02:23 PM
An FPGA-Based Framework for Technology-Aware Prototyping of Multicore Embedded Architectures
An FPGA-Based Framework.docx (Size: 460.14 KB / Downloads: 21)
ABSTRACT
The use of cycle-accurate software simulators asa foundation for the exploration of all the possible full-system hardware–software (hw–sw) configurations does not appear to be anymore a feasible way to handle modern embedded multicore systems complexity. In this letter, an field programmable gate array (FPGA)-based cycle-accurate hardware emulation frame-work is presented and proposed as a research accelerator for the exploration of complete multicore systems. The framework pro-vides the possibility to extract from the automatically instantiated hardware-emulated system a set of metrics for the assessment of the performance and the evaluation of the architectural tradeoffs, as well as the estimation of figures of power and area consumption of a prospective application-specified integrated circuit (ASIC) implementation of the considered architecture.
INTRODUCTION
HE prediction of the performances of modern multi-T core architectures requires an effective solution of thespeed-accuracy tradeoff. The interest has recently shifted from well-established cycle-accurate full-system simulators to the adoption of Þeld programmable gate array (FPGA)-based hardware emulation platforms, whose trends in integration capability, speed, and price propose them as a candidate to speed-up the exploration of large multicore architectures [3]. Moreover, to consider already at system/architectural level the variables related to the low level implementation, the concept of Òsystem-level design with technology-awarenessÓ must be introduced. Detailed area, frequency, and power models can be used to back-annotate the architectural assumptions and the experimental results obtained by means of the prototyping. This letter presents an FPGA-based framework for the emula-tion of complex and large multicore architectures that allows the easy instantiation of the desired system conÞguration and automatically generates the hardware description Þles for the FPGA synthesis and implementation. The prototyping results can be duely back-annotated using analytic models included in the framework, to evaluate a prospective application-speciÞed integrated circuit (ASIC) implementation of the system.
RELATED WORK
To date, software cycle-accurate simulation has been the primary tool to allow collaborative hardware and software research [5].
However, for parallel software development, such approaches to simulation do not provide a practical speed-accuracy tradeoff.
A Þrst approach aims at achieving the maximum speed of the simulation by raising the abstraction-level of the described architecture. Simics [14] is one of the best known full-system functional simulators. It offers the level of accuracy necessary to execute fairly complex binaries on the simulated machine, including operating systems. Cycle-accurate timing simula-tions can be performed including custom modules that extend Simics through its set of application programming interface (APIs). A timing multiprocessor simulator built on top of the Simics library is GEMS [15], a SPARC-based multiprocessor and its memory hierarchy simulation are targeted. Extensions of Simics targeting the simulation of reconÞgurable hardware processor extensions have been developed, as reported in [12]. MC-Sim [9] is a multiaccuracy software-based simulator in which the processing cores are simulated with functional accuracy, preserving the highest modularity (through deÞnition of speciÞc APIs) to enable the possible addition of custom processor or cache models. The on-chip interconnection model included in MC-Sim, instead, supports timing simulation. The letter presents also a methodology for automatic generation of fast (claimed 45x over RTL) C-based simulators for co-processors from a high-level description. ReSP [4] presents a TLM SystemC-based simulation platform that introduces au-tomatically generated Python wrappers that provide increased ßexibility, in terms of integration of new components and advanced simulation control capabilities. The Liberty [20] modeling framework emphasizes the reusability of components and the minimization of the speciÞcation overhead. The user speciÞes a structural system description that is automatically translated into a simulator executable.
The SHMPI Topology Builder
The SHMPI topology builder generates the actual RTL cores (processing, interconnection, memories) of the platform in a library-based approach (HDL instantiation), based on the system-level speciÞcationÞle input by the designer. The SHMPI topology builder includes the parsing engine of Xpipes compiler, a tool developed for the automatic instantiation of application-speciÞc interconnection networks [13]. New func-tions have been developed, enabling composition of the entire multicore hardware platform (including processing elements and memory hierarchy deÞnition), automatic conÞguration of the software libraries, and integration with the Xilinx develop-ment tools.
SIMULATION SPEED AND ACCURACY ASSESSMENT
A point of interest in using FPGA-based emulators is cer-tainly the speedup achievable over software-based cycle-accu-rate simulators. All the results related to functional and phys-ical metrics showed in Fig. 3 and Fig. 2 are obtained, for each topology, with a single FPGA emulation. The time needed for application execution and performance data outputting is 0.8 sec. This result is coherent with [10] and [21], where multicore FPGA-based emulators are assessed to be three orders of magni-tude faster than software-based simulators, when not accounting for the time spent on HW implementation ßow. When the emu-lation platform is instead used inside a design space exploration cycle, a factor limiting the mentioned speedup is the time needed to traverse the whole FPGA implementation ßow.
In Fig. 4, we provide an overview of how the FPGA im-plementation effort scales for regular quasi-mesh topologies with increasing number of processors. The implementation ßows have been performed by Xilinx ISE 10.1.3 on a DualCore AMD Opteron (@2.2 GHz) with 6 GB RAM. To shrink the synthesis time, we managed to build a library of reusable presynthesized components with different parameter conÞg-urations. For topologies not larger than eight processors, the total FPGA implementation time does not exceed one hour. Thus, we can consider iterative optimization and exploration