02-08-2013, 12:15 PM
GPU Computing: The Democratization of Parallel Computing
GPU Computing.pdf (Size: 2.52 MB / Downloads: 69)
Parallel Computing’s Golden Age
1980s, early `90s: a golden age for parallel computing
Particularly data-parallel computing
Architectures
Connection Machine, MasPar, Cray
True supercomputers: incredibly exotic, powerful, expensive
Algorithms, languages, & programming models
Solved a wide variety of problems
Various parallel algorithmic models developed
Enter the GPU
GPUs are massively multithreaded manycore chips
NVIDIA Tesla products have up to 128 scalar processors
Over 12,000 concurrent threads in flight
Over 470 GFLOPS sustained performance
Users across science & engineering disciplines are
achieving 100x or better speedups on GPUs
CS researchers can use GPUs as a research platform
for manycore computing: arch, PL, numeric, ...
Enter CUDA
CUDA is a scalable parallel programming model and a
software environment for parallel computing
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel programming model
NVIDIA’s TESLA GPU architecture accelerates CUDA
Expose the computational horsepower of NVIDIA GPUs
Enable general-purpose GPU computing
Device Emulation Mode
An executable compiled in device emulation mode
(nvcc -deviceemu) runs completely on the host
using the CUDA runtime
No need of any device and CUDA driver
Each device thread is emulated with a host thread
When running in device emulation mode, one can:
Use host native debug support (breakpoints, inspection,
etc.)
Access any device-specific data from host code and vice-
versa
Call any host function from device code (e.g. printf) and
vice-versa
Detect deadlock situations caused by improper usage of
__syncthreads
Host Synchronization
All kernel launches are asynchronous
control returns to CPU immediately
kernel executes after all previous CUDA calls have
completed
cudaMemcpy is synchronous
control returns to CPU after copy completes
copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
blocks until all previous CUDA calls complete