14-05-2014, 04:27 PM
CUDA – based Parallel Implementation of JPEG Compression
1370276723-Mekhareport (1).docx (Size: 245.51 KB / Downloads: 30)
CHAPTER 1
INTRODUCTION
An important part of data processing is processing images. The objective of image compression is to reduce irrelevance and redundancy of the image data in order to be able to store or transmit data in an efficient form. One of the main problems with storing and processing images is that image contains a large amount of data and require large amounts of computations. The computational complexity involved in image compression can be efficiently handled and performance can be improved through parallel program designs. Parallel computing is a form of computation in which many computations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. NVIDIA Corporation officially released the CUDA (Compute Unified Device Architecture). CUDA is a parallel computing platform and programming model. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). Compared with the traditional GPU, CUDA GPU has a significant improvement in the Architecture. CUDA has abandoned the separated design of the traditional GPU programmable unit, and adopted a unified processing architecture in hardware. Such a change lead to more efficient use of distributed computing resources, which is more conducive for the general purpose computing. . The powerful parallel computing power of CUDA GPU can improve the processing speed of JPEG image compression easily.
CHAPTER 2
CUDA – COMPUTE UNIFIED DEVICE ARCHITECTURE
The general purpose computing of GPU is increasingly active. A graphics processing unit (GPU), also occasionally called visual processing unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard or—in certain CPUs—on the CPU die.
CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. NVIDIA graphics processing units (GPUs) implement the CUDA architecture and programming model. The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives (such as OpenACC), and extensions to industry-standard programming languages, including C, C++ and Fortran. C/C++ programmers use 'CUDA C/C++' (C/C++ with CUDA extensions to express parallelism, data locality, and thread cooperation, as well as some restrictions), compiled with "nvcc", NVIDIA's LLVM-based C/C++ compiler,[2] to code algorithms for execution on the GPU. Fortran programmers can use 'CUDA Fortran' (Fortran with CUDA extensions to express parallelism, data locality, and thread cooperation, as well as some restrictions), compiled with the PGI CUDA Fortran compiler from The Portland Group.
Using CUDA, the latest Nvidia GPUs become accessible for computation like CPUs. Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very quickly. This approach of solving general purpose problems on GPUs is known as GPGPU. In the computer game industry, in addition to graphics rendering, GPUs are used in game physics calculations
game physics calculations (physical effects like debris, smoke, fire, fluids). CUDA has also been used to accelerate non-graphical applications in computational biology, cryptography and other fields by an order of magnitude or more. CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later added in version 2.0, which supersedes the beta released February 14, 2008. CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line. CUDA is compatible with most standard operating systems. Nvidia states that programs developed for the G8x series will also work without modification on all future Nvidia video cards, due to binary compatibility.
2.1 CUDA Threads
The parallel computing function running on the GPU is called kernel, which when called, are executed N times in parallel by N different CUDA threads. Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. Threads within a block can cooperate among themselves by sharing data through some shared memory and synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core, much like an L1 cache, __syncthreads() is expected to be lightweight, and all threads of a block are expected to reside on the same processor core. The number of threads per block is therefore restricted by the limited memory resources of a processor core. On current GPUs, a thread block may contain up to 512 threads. However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks. These multiple blocks are organized into a one-dimensional or two-dimensional grid of thread blocks. Each block within the grid can be identified by a one-dimensional or two-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable. Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any
Host – Device Program Model
The concept of ‘Host’ and ‘Device’ is introduced in CUDA program model. It makes CPU as Host, and GPU as coprocessor Device. There can be one Host and a number of Devices in a single system. CUDA’s programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program. This is the case, for example, when the kernels execute on a GPU and the rest of the C program executes on a CPU. Under this model, Host and Device work together. The CPU is responsible for the strongly logical transactions and serial computing; the GPU is focused on the implementation of highly threaded parallel processing tasks. The Host and Device are independent of storage space. The parallel computing function running on the GPU is called “kernel”. A kernel is not a complete program, but a whole CUDA program as well as a parallel steps.
2.3 Advantages of CUDA
CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs:
• Scattered reads – code can read from arbitrary addresses in memory
• Shared memory– CUDA exposes a fast shared memory region (up to 48KB per Multi-Processor) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.
• Faster downloads and readbacks to and from the GPU
• Full support for integer and bitwise operations, including integer texture lookups
2.4 Limitations of CUDA
• Texture rendering is not supported (CUDA 3.2 and up addresses this by introducing "surface writes" to CUDA arrays, the underlying opaque data structure).
• Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly alleviated with asynchronous memory transfers, handled by the GPU's DMA engine)
• Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches in the program code do not impact performance significantly, provided that each of 32 threads takes the same execution path; the SIMD execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning data structure during ray tracing).
• Unlike Open CL, CUDA-enabled GPUs are only available from Nvidia
• Valid C/C++ may sometimes be flagged and prevent compilation due to optimization techniques the compiler is required to employ to use limited resources.
• CUDA (with compute capability 1.x) uses a recursion-free, function-pointer-free subset of the C language, plus some simple extensions. However, a single process must run spread across multiple disjoint memory spaces, unlike other C language runtime environments.
• CUDA (with compute capability 2.x) allows a subset of C++ class functionality, for example member functions may not be virtual (this restriction will be removed in some future release).