02-05-2014, 02:50 PM
An Effective GPU Implementation of Breadth-First Search
Effective GPU Implementation .pdf (Size: 381.18 KB / Downloads: 126)
ABSTRACT
Breadth-first search (BFS) has wide applications in elec-
tronic design automation (EDA) as well as in other fields.
Researchers have tried to accelerate BFS on the GPU, but
the two published works are both asymptotically slower than
the fastest CPU implementation. In this paper, we present
a new GPU implementation of BFS that uses a hierarchical
queue management technique and a three-layer kernel ar-
rangement strategy. It guarantees the same computational
complexity as the fastest sequential version and can achieve
up to 10 times speedup.
INTRODUCTION
The graphics processing unit (GPU) has become a popu-
lar cost-effective parallel platform in recent years. Although
parallel execution on a GPU can easily achieve speedup of
tens or hundreds of times over straightforward CPU imple-
mentations, to accelerate intelligently designed and well op-
timized CPU algorithms is a very difficult job. Therefore,
when researchers report exciting speedup on GPUs.
OUR GPU SOLUTION
The BFS process propagates through the graph by levels.
In each level, there are two types of parallelism: one is to
propagate from all the frontier vertices in parallel, and the
other is to search every neighbor of a frontier vertex in par-
allel. Since a lot of EDA problems such as circuit simulation,
static timing analysis and routing are formulated on sparse
graphs, we do not expect many neighbors for each frontier
vertex. Therefore, our implementation of BFS only explores
the first type of parallelism, i.e. each thread is dedicated to
one frontier vertex of the current level.
Overview of CUDA on the Nvidia GTX280
The NVIDIA GeForce GTX280 graphics processor is a
collection of 30 multiprocessors, with 8 streaming proces-
sors each. The 30 multiprocessors share one off-chip global
memory. Note that global memory is not cached, so it is
very important to achieve memory coalescing. Memory coa-
lescing happens when consecutive threads access consecutive
memory locations. In this case, several memory transactions
can be coalesced into one transaction. Within each multi-
processor there is a shared memory, which is common to all 8
streaming processors inside the multiprocessor. The shared
memory is on-chip memory. To access a shared memory only
takes 2 clock cycles compared with approximately 300 clock
cycles for global memory.