28-04-2014, 12:12 PM
GPUS AND THE FUTURE OF PARALLEL COMPUTING
GPUS AND THE FUTURE .pdf (Size: 646.11 KB / Downloads: 10)
Although Moore’s law has con-
tinued to provide smaller semiconductor de-
vices, the effective end of clock rate scaling
and aggressive uniprocessor performance
scaling has instigated mainstream computing
to adopt parallel hardware and software.
Today’s landscape includes various parallel
chip architectures with a range of core
counts, capability per core, and energy per
core. Driven by graphics applications’ enor-
mous appetite for both computation and
bandwidth, GPUs have emerged as the dom-
inant massively parallel architecture available
to the masses. GPUs are characterized by nu-
merous simple yet energy-efficient computa-
tional cores, thousands of simultaneously
active fine-grained threads, and large off-
chip memory bandwidth.
Challenges for parallel-computing chips
Scaling the performance and capabilities
of all parallel-processor chips, including
GPUs, is challenging. First, as power supply
voltage scaling has diminished, future archi-
tectures must become more inherently en-
ergy efficient. Second, the road map for
memory bandwidth improvements is slowing
down and falling further behind the compu-
tational capabilities available on die. Third,
even after 40 years of research, parallel pro-
gramming is far from a solved problem.
Addressing these challenges will require re-
search innovations that depart from the evo-
lutionary path of conventional architectures
and programming systems.
Power and energy
Because of leakage constraints, power sup-
ply voltage scaling has largely stopped, caus-
ing energy per operation to now scale only
linearly with process feature size. Meanwhile,
the number of devices that can be placed on
a chip continues to increase quadratically
with decreasing feature size. The result is
that all computers, from mobile devices to
supercomputers, have or will become con-
strained by power and energy rather than
area. Because we can place more processors
on a chip than we can afford to power and
cool, a chip’s utility is largely determined
by its performance at a particular power
level, typically 3 W for a mobile device and
150 W for a desktop or server component.
Memory bandwidth and energy
The memory bandwidth bottleneck is a
well-known computer system challenge that
can limit application performance. Although
GPUs provide more bandwidth than CPUs,
the scaling trends of off-chip bandwidth rel-
ative to on-chip computing capability are not
promising. Figure 1 shows the historical trends
for single-precision performance, double-
precision performance, and off-die memory
bandwidth (indexed to the right y-axis) for
high-end GPUs. Initially, memory band-
width almost doubled every two years, but
over the past few years, this trend has slowed
significantly. At the same time, the GPU’s
computational performance continues to
grow at about 45 percent per year for single
precision and at a higher recent rate for dou-
ble precision.
Echelon: A research GPU architecture
Our Nvidia Research team has embarked
on the Echelon project to develop computing
architectures that address energy-efficiency
and memory-bandwidth challenges and pro-
vide features that facilitate programming of
scalable parallel systems. Echelon is a gen-
eral-purpose fine-grained parallel-computing
system that performs well on a range of
applications, including traditional and
emerging computational graphics as well as
data-intensive and high-performance com-
puting. At a 10 nm process technology in
2017, the Echelon project’s initial perfor-
mance target is a peak double-precision
throughput of 16 Tflops, a memory band-
width of 1.6 terabytes/second, and a power
budget of less than 150 W.
Malleable memory system
Echelon aims to provide energy-efficient
data access to applications with a range of
characteristics. Applications with hierarchical
data reuse, such as dense linear algebra, ben-
efit from a deep caching hierarchy. Applica-
tions with a plateau in their working set
benefit from a shallow caching hierarchy to
capture the entire plateau on chip. When
programmers can identify their working
sets, they can obtain high performance and
efficiency by placing them into physically
mapped on-chip storage (scratch pads or
shared memory in Nvidia terminology). Re-
gardless of an application’s access pattern, the
key to both energy and bandwidth efficiency
is to limit the volume of data transferred be-
tween memory hierarchy levels.