GPUS AND THE FUTURE OF PARALLEL COMPUTING pdf

**seminar projects maker** · 28-04-2014, 12:12 PM

GPUS AND THE FUTURE OF PARALLEL COMPUTING

.pdf

GPUS AND THE FUTURE .pdf (Size: 646.11 KB / Downloads: 10)

Although Moore’s law has con-
tinued to provide smaller semiconductor de-
vices, the effective end of clock rate scaling
and aggressive uniprocessor performance
scaling has instigated mainstream computing
to adopt parallel hardware and software.
Today’s landscape includes various parallel
chip architectures with a range of core
counts, capability per core, and energy per
core. Driven by graphics applications’ enor-
mous appetite for both computation and
bandwidth, GPUs have emerged as the dom-
inant massively parallel architecture available
to the masses. GPUs are characterized by nu-
merous simple yet energy-efficient computa-
tional cores, thousands of simultaneously
active fine-grained threads, and large off-
chip memory bandwidth.

Challenges for parallel-computing chips

Scaling the performance and capabilities
of all parallel-processor chips, including
GPUs, is challenging. First, as power supply
voltage scaling has diminished, future archi-
tectures must become more inherently en-
ergy efficient. Second, the road map for
memory bandwidth improvements is slowing
down and falling further behind the compu-
tational capabilities available on die. Third,
even after 40 years of research, parallel pro-
gramming is far from a solved problem.
Addressing these challenges will require re-
search innovations that depart from the evo-
lutionary path of conventional architectures
and programming systems.

Power and energy

Because of leakage constraints, power sup-
ply voltage scaling has largely stopped, caus-
ing energy per operation to now scale only
linearly with process feature size. Meanwhile,
the number of devices that can be placed on
a chip continues to increase quadratically
with decreasing feature size. The result is
that all computers, from mobile devices to
supercomputers, have or will become con-
strained by power and energy rather than
area. Because we can place more processors
on a chip than we can afford to power and
cool, a chip’s utility is largely determined
by its performance at a particular power
level, typically 3 W for a mobile device and
150 W for a desktop or server component.

Memory bandwidth and energy

The memory bandwidth bottleneck is a
well-known computer system challenge that
can limit application performance. Although
GPUs provide more bandwidth than CPUs,
the scaling trends of off-chip bandwidth rel-
ative to on-chip computing capability are not
promising. Figure 1 shows the historical trends
for single-precision performance, double-
precision performance, and off-die memory
bandwidth (indexed to the right y-axis) for
high-end GPUs. Initially, memory band-
width almost doubled every two years, but
over the past few years, this trend has slowed
significantly. At the same time, the GPU’s
computational performance continues to
grow at about 45 percent per year for single
precision and at a higher recent rate for dou-
ble precision.

Echelon: A research GPU architecture

Our Nvidia Research team has embarked
on the Echelon project to develop computing
architectures that address energy-efficiency
and memory-bandwidth challenges and pro-
vide features that facilitate programming of
scalable parallel systems. Echelon is a gen-
eral-purpose fine-grained parallel-computing
system that performs well on a range of
applications, including traditional and
emerging computational graphics as well as
data-intensive and high-performance com-
puting. At a 10 nm process technology in
2017, the Echelon project’s initial perfor-
mance target is a peak double-precision
throughput of 16 Tflops, a memory band-
width of 1.6 terabytes/second, and a power
budget of less than 150 W.

Malleable memory system

Echelon aims to provide energy-efficient
data access to applications with a range of
characteristics. Applications with hierarchical
data reuse, such as dense linear algebra, ben-
efit from a deep caching hierarchy. Applica-
tions with a plateau in their working set
benefit from a shallow caching hierarchy to
capture the entire plateau on chip. When
programmers can identify their working
sets, they can obtain high performance and
efficiency by placing them into physically
mapped on-chip storage (scratch pads or
shared memory in Nvidia terminology). Re-
gardless of an application’s access pattern, the
key to both energy and bandwidth efficiency
is to limit the volume of data transferred be-
tween memory hierarchy levels.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Software Crisis pdf	study tips	1	2,117	21-09-2017, 04:31 PM Last Post: jaseela123
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	HOW EMAIL WORKS pdf	project girl	1	3,067	20-09-2017, 11:39 AM Last Post: jaseela123
	Cyber crime detection, investigation and prosecution pdf	seminar projects maker	1	958	20-09-2017, 11:31 AM Last Post: jaseela123
	Review: Context Aware Tools for Smart Home Development pdf	study tips	1	1,227	20-09-2017, 11:22 AM Last Post: jaseela123
	Green Computing-A Critical Necessity For Making ICT Judicious	seminar flower	1	3,105	19-09-2017, 03:35 PM Last Post: jaseela123
	Getting Started with the MAXQ1103 Evaluation Kit and the CrossWorks Compiler pdf	project girl	1	969	15-09-2017, 03:11 PM Last Post: jaseela123
	Wireless Application Protocol (WAP) pdf	project girl	1	1,531	15-09-2017, 02:42 PM Last Post: jaseela123
	MAC Protocol for Reliable Multicast over Multi-Hop Wireless Ad Hoc Networks pdf	study tips	1	1,029	15-09-2017, 12:39 PM Last Post: jaseela123
	Wireless Automotive Communications pdf	seminar projects maker	1	637	14-09-2017, 01:27 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.