An Effective GPU Implementation of Breadth-First Search pdf

**seminar projects maker** · 02-05-2014, 02:50 PM

An Effective GPU Implementation of Breadth-First Search

.pdf

Effective GPU Implementation .pdf (Size: 381.18 KB / Downloads: 126)

ABSTRACT

Breadth-first search (BFS) has wide applications in elec-
tronic design automation (EDA) as well as in other fields.
Researchers have tried to accelerate BFS on the GPU, but
the two published works are both asymptotically slower than
the fastest CPU implementation. In this paper, we present
a new GPU implementation of BFS that uses a hierarchical
queue management technique and a three-layer kernel ar-
rangement strategy. It guarantees the same computational
complexity as the fastest sequential version and can achieve
up to 10 times speedup.

INTRODUCTION

The graphics processing unit (GPU) has become a popu-
lar cost-effective parallel platform in recent years. Although
parallel execution on a GPU can easily achieve speedup of
tens or hundreds of times over straightforward CPU imple-
mentations, to accelerate intelligently designed and well op-
timized CPU algorithms is a very difficult job. Therefore,
when researchers report exciting speedup on GPUs.

OUR GPU SOLUTION

The BFS process propagates through the graph by levels.
In each level, there are two types of parallelism: one is to
propagate from all the frontier vertices in parallel, and the
other is to search every neighbor of a frontier vertex in par-
allel. Since a lot of EDA problems such as circuit simulation,
static timing analysis and routing are formulated on sparse
graphs, we do not expect many neighbors for each frontier
vertex. Therefore, our implementation of BFS only explores
the first type of parallelism, i.e. each thread is dedicated to
one frontier vertex of the current level.

Overview of CUDA on the Nvidia GTX280

The NVIDIA GeForce GTX280 graphics processor is a
collection of 30 multiprocessors, with 8 streaming proces-
sors each. The 30 multiprocessors share one off-chip global
memory. Note that global memory is not cached, so it is
very important to achieve memory coalescing. Memory coa-
lescing happens when consecutive threads access consecutive
memory locations. In this case, several memory transactions
can be coalesced into one transaction. Within each multi-
processor there is a shared memory, which is common to all 8
streaming processors inside the multiprocessor. The shared
memory is on-chip memory. To access a shared memory only
takes 2 clock cycles compared with approximately 300 clock
cycles for global memory.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	The Contourlet Transform: An Efficient Directional Multiresolution Image pdf	seminar post	1	1,150	21-09-2017, 10:38 AM Last Post: jaseela123
	Introduction to Vibration Energy Harvesting pdf	study tips	1	1,168	13-09-2017, 02:53 PM Last Post: jaseela123
	Basic Experiments in PID Control for Non-electrical Engineers pdf	project girl	1	4,984	13-09-2017, 01:42 PM Last Post: jaseela123
	Pulse Width Modulated (PWM) Controller for 12 Volt Motors pdf	seminar projects maker	1	1,296	13-09-2017, 10:51 AM Last Post: jaseela123
	Application of High Power Thyristors in HVDC and FACTS Systems pdf	project girl	1	1,086	09-09-2017, 01:03 PM Last Post: jaseela123
	Clippers And Clampers pdf	project girl	1	1,016	09-09-2017, 11:16 AM Last Post: jaseela123
	POWER SYSTEM PROTECTION pdf	study tips	1	1,464	09-09-2017, 09:08 AM Last Post: jaseela123
	LINEAR WAVESHAPING pdf	study tips	1	903	09-09-2017, 09:06 AM Last Post: jaseela123
	IMPLEMENTATION OF A NEWTON-BASED OPTIMAL POWER FLOW INTO A POWER SYSTEM	seminar flower	1	2,353	02-09-2017, 04:10 PM Last Post: jaseela123
	A mathematical model to choose effective cutting parameters in electroerosion, EDM	seminar paper	1	2,281	31-08-2017, 01:56 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.