12-12-2012, 11:44 AM
General Purpose Computation on Graphics Processing Units (GPGPU) using CUDA
General Purpose Computation.pdf (Size: 1.48 MB / Downloads: 27)
Introduction
Graphics processing units (GPUs) are special processors which traditionally were used to accelerate computer graphics by offloading work from the CPU. Today, GPUs are highly parallel many-core processors which enable general-purpose computation on graphics processing units (GPGPU). GPGPU has already been an issue since 2002 but a huge inter-est did not evolve until Nvidia released the CUDA platform in 2007. Developers and re-searchers started to use CUDA for parallel programming. The current high visibility in science and practice, especially in parallel, scientific and high-performance computing (HPC), is one reason for this paper. Further motivation arises through interest in computer graphics and parallel computing.
Nvidia’s architecture CUDA was chosen for this paper because it was the dominating plat-form for GPGPU at the time of writing. Nvidia was the first vendor that diffused a com-prehensive architecture combining huge programmability, performance, and ease of use. However, CUDA is challenged by AMD’s alternative ATI Stream [AMD09a] as well as two standardization approaches, OpenCL [Kh09b] and DirectX11 DirectCompute [MS09d]. Nvidia is also facing competition in other markets. Intel and AMD both use a platform strategy, combining x86 CPUs, graphics, and chipsets and trying to put Nvidia out of the chipset market [Wi09]. While Nvidia confirms that it has no intention of con-structing x86 processors [Cr09], GPGPU, HPC, and parallel computing have become a major strategic pathway. Effects of this are huge research, marketing, and collaboration efforts, e.g. lectures, tutorials, student scholarships, and partnerships with professors, uni-versities, and software development companies, which resulted in a large amount of scien-tific publications and parallel applications [NVI09m].
GPGPU Basics
The following chapter will introduce basics of computer graphics and graphics hardware which form the background of GPGPU. During this, some basic terms for computer graph-ics objects will be needed: Geometric primitives are simple atomic geometric objects like points, lines, triangles, or other polygons. The corner points of these objects are called ver-tices. Another basic object is a fragment which is the basis for a pixel. In addition to the color value, the fragment also contains other information that is needed before the pixel is drawn, e.g. the position, the depth, or the alpha value (for transparency) [Ha06, Ih09 pp. 9-10].
2.1 Graphics Pipeline
A graphics pipeline (also called rendering pipeline) is a model that describes different steps performed to render a scene. The pipeline concept can be compared to the CPU instruction pipeline: The individual steps are done in parallel, but are blocked until the latest step is finished. One simple model of a (fixed-function) graphics pipeline is depicted in Fig. 1.
Graphics APIs
Graphics APIs provide programmers a high level of abstraction and simplify the software development process by hiding complexity and capabilities of graphics hardware and de-vice drivers. The two most important graphics APIs will be briefly introduced in the fol-lowing.
Direct3D is an API for drawing 3D graphics and the most prominent component of the comprehensive DirectX API collection for multimedia applications on Microsoft platforms (Windows and Xbox). An advantage for programmers using DirectX is the huge market penetration, which enables Microsoft to define minimum hardware specifications for graphics components in collaboration with the graphics vendors. Disadvantages like the fact that it is proprietary, low backward compatibility, and short release cycles can be criti-cized [BB03 p. 4]. However, the last two arguments also provide the basis for innovations: Until Direct3D 10, the most interesting development for GPGPU was the introduction of different shader models (cf. Section 2.3). The current version 11 has been released in Oc-tober 2009 and features hardware support for tessellation, which increases the amount of polygons through subdivision of polygons at runtime within the pipeline of the GPU, in-creased multi-threading support (for multi-core CPUs), and DirectCompute, Microsoft’s new approach to GPGPU [Be09a].
Graphics Hardware
In modern PCs, GPUs are either present on a dedicated graphics card or on the mother-board as integrated graphics solution. The latter usually have little or no own graphics memory, compete with the CPU in utilizing main memory, and reside at the lower price and performance spectrum. However, the computing power is generally sufficient for sim-ple 2D and 3D graphics tasks. Problems arise e.g. with complex 3D video games in high resolutions, CAD software, or GPGPU. High-performance GPUs are typically only availa-ble as dedicated graphics cards. The cards are connected to the system via an expansion slot, currently PCI Express (PCIe) v2.0 which uses point-to-point serial links. The serial links are composed of one to 32 lanes, each lane carrying 500 MB/s. Most contemporary cards are connected via 16 lanes which allows for a data transfer speed of 8 GB/s (full dup-lex) [PCI09]. It will later become clear that this is a major bottleneck for GPGPU applica-tions.
Traditional GPGPU
Traditional GPGPU was already possible in 2002. The requirements for this were the in-creasing performance and programmability, the latter realized through graphics shaders and the introduction of more complex and precise data types. Early GPUs operated with eight bit integers (pixels with 256 colors). Floating point data types with different grades of precisions were added later [Ha06 ch. 2]. The first GPGPU programs directly used the graphics APIs and hence, were written in HLSL, GLSL, or Cg. The programs had to utilize the computational units on the graphics card in a restrictive and differentiated way. The texture unit was used as read only memory, the framebuffer as write only memory. The vertex and pixel shaders were used to execute the kernels. The rasterizer was used for ad-dress calculation [Ha06].