09-10-2012, 10:48 AM
Event-Driven Gate-Level Simulation with GP-GPUs
Event-Driven Gate-Level.pdf (Size: 274.71 KB / Downloads: 39)
ABSTRACT
Logic simulation is a critical component of the design tool flow
in modern hardware development efforts. It is used widely – from
high-level descriptions down to gate-level ones – to validate several
aspects of the design, particularly functional correctness. Despite
development houses investing vast resources in the simulation task,
particularly at the gate-level, it is still far from achieving the performance
demands required to validate complex modern designs.
In this work, we propose the first event-driven logic simulator
accelerated by a parallel, general purpose graphics processor (GPGPU).
Our simulator leverages a gate-level event-driven design to
exploit the benefits of the low switching activity that is typical of
large hardware designs. We developed novel algorithms for circuit
netlist partitioning and optimized for a highly-parallel GP-GPU
host. Moreover, our flow is structured to extract the best simulation
performance from the target hardware platform. We found that
our experimental prototype could handle large, industrial scale designs
comprised of millions of gates and deliver a 13x speedup on
average over current commercial event-driven simulators.
INTRODUCTION
Logic simulation is the validation workhorse of modern digital
designs. It is used to verify designs at the behavioral level, as well
as the structural level, ensuring that a synthesized circuit’s netlist
matches the functionality and timing of the behavioral model. Structural
netlists are particularly cumbersome for simulation because
of their low-level specification and the fine granularity of the structural
definition, which consists of gate primitives in the target technology
library. It is typical for design houses to invest the computational
power of large simulation “farms” to complete as many simulation
cycles as possible before final design tapeout. However, even
with such investment in today’s development efforts, large portions
of a design go unverified. The result is unforeseen bugs that are released
into silicon, which may have drastic impacts, ranging from
silicon respins to market recalls.
The root cause of this situation lies in the vast complexity of
modern designs (several million gates) and the fact that the performance
of commercial logic simulators is inversely proportional to
their size. In addition, the technology in commercial logic simulators
today is fairly mature.
Contributions
In this work, we present the first event-driven GPU-based logic
simulator, which leverages GPUs’ massive parallelism to achieve
large performance speedups over commercial logic simulators. Our
solution leverages a novel macro-gate segmentation algorithm, designed
specifically to benefit fromthe CUDA architecture. Amacrogate
comprises several gates of the original netlist connected to
each other. The macro-gates generated cover the entire circuit’s
netlist; they are compiled into a suitable data structure and transferred
to the GPU’s memory. During simulation, those macro-gates
that require simulation because their input values have changed,
are tagged for execution and handed over to the CUDA’s low-level
scheduler. We developed a prototype of our simulator and applied
it to a range of designs.
RELATED WORK
Research on logic simulators bloomed in the 1980s, when the
concepts of circuit netlist compilation, oblivious and event-driven
simulation were first explored [6, 3, 14, 2]. In particular, [2] provides
a comparative analysis of early attempts to parallelize eventdriven
simulation by dividing the processing of individual events
across multiple machines with fine granularity. This fine granularity
would generate a high communication overhead and, depending
on the solution, the issue of deadlock avoidance could require specialized
event handling. Parallel logic simulation algorithms were
also proposed for distributed systems [16, 15] and multiprocessors
[12]. In these solutions, individual execution threads would operate
on distinct netlist clusters and communicate in an event-driven
fashion, with a thread being activated if switching activity was observed
at the inputs of its netlist cluster. Both conservative [7, 17,
10] and speculative techniques, such as time warp [5, 4], were proposed
to handle synchronization in these discrete event algorithms.
Today, several commercial simulators building on these concepts
are available: they execute on a single CPU and adopt aggressive
compiled-code optimization techniques to boost their performance.
In addition, specialized hardware solutions (emulation systems)
have also been explored to boost simulation performance. These
systems typically consist of several identical hardware units connected
together, with units optimized for the simulation of small
logic blocks. To emulate a circuit netlist, a “compiler” partitions
the netlist into blocks and then loads each block into separate units
[9, 1, 13]. Modern emulators can deliver 3-4 orders of magnitude
speedup and they can handle very large designs. However, their
cost is prohibitively large and the process of successfully mapping
a netlist to an emulator can take up to few months.
OVERVIEW
Our event-driven CUDA-accelerated simulator first applies a com-
pilation phase, during which it transforms the netlist to leverage
the raw performance of the target architecture. This is followed by
a simulation phase where the compiled netlist is uploaded to the
GPU co-processor and one or more simulations may be executed
with different input testbenches.
The compilation phase is responsible for segmenting a largemonolithic
netlist into blocks amenable to simulation by individual execution
units within the GPU. This requires segmenting the netlist
into macro-gates: a set of several connected gates within the netlist
of ideal size, optimizing the logic within each macro-gate, and finally
producing the data structures and the CUDA programs necessary
to carry out the simulation. During simulation, both program
and data reside on the GPU. Testbenches can be implemented using
many different solutions; if they are encoded in a CUDA program
(possibly with associated stimuli data), then the simulation can be
completely offloaded from the host with direct performance benefits.
If the testbench resides on the host, control alternates between
host and GPU to simulate and generate stimuli.
Segmentation into macro-gates
To exploit the parallelism available in the GPU, we must segment
the gate-level netlist into several logic blocks (called macro-gates),
and assign the simulation of each macro-gate to a distinct CUDA
multiprocessor. During simulation, we maintain a sensitivity list of
nets at the inputs of each macro-gate: if any net in a sensitivity list
changes value, then the corresponding macro-gate will be affected
by the change and must be simulated (i.e.activated). Otherwise, the
macro-gate can be skipped during the current cycle.
In determining how to partition the netlist into macro-gates, we
took into consideration several factors: (i) the time required to simulate
a macro-gate should be greater than overhead of determining
which macro-gates to simulate; (ii) CUDA’s multiprocessors
can only communicate through device memory, thus macro-gates
should not share data. To this end, we occasionally duplicate small
portions of logic, so that each macro-gate can compute the value of
its outputs independent of other concurrent macro-gates. Finally,
(iii) we want to avoid cyclic dependencies between macro-gates, so
to simulate each macro-gate at most once per cycle.
Macro-gate balancing
Each macro-gate is designed to be simulated in a single CUDA
multiprocessor. Because our lowest-level primitives are basic logic
gates, we designed our CUDA simulation program so that the execution
threads simulate all the gates in the same level, then move
on to the next level, and so on, until an entire macro-gate has been
simulated. Thus the gap is directly proportional to layer simulation
performance. However, the segmentation procedure tends to generate
macro-gates with a large base (many gates) and a narrow tip.
Correspondingly, we have many active threads in the lower levels,
and just a few in the top levels.
To maximize concurrency throughout the simulation, we optimize
each macro-gate individually with a balancing step, as outlined
in the schematic of Figure 4. This is the last step of the
compilation phase: it exploits the slack available in the levelization
within each macro-gate and restructures macro-gates to have
approximately the same number of logic gates in each level. As a
result, a smaller number of threads will be required to simulate the
base of the macro-gate. Note that it is always possible to “shrink”
the size of the base, at the price of an increased gap.
CONCLUSIONS
In this work, we have presented a novel event-driven simulator
architecture that leverages the high-level of parallelism of general
purpose GPUs. By extracting parallelism in the simulation of gatelevel
netlists, we are able to realize a 13 times speedup over traditional
sequential simulators, on average.