30-11-2012, 06:17 PM
CUDA PROGRAMMING
CUDA.pptx (Size: 221.12 KB / Downloads: 39)
Objective
Basic objective is to select algorithms with vector operations (managed excellently by CUDA) and manually improve their running time.
The project will demonstrate the optimization achieved by implementing slower algorithms on a parallel processing platform.
The performance increase will be mapped to the programming technique used and the best possible methodology for implementing parallel algorithms will be investigated.
Progress
Investigation of concept.
Study of papers with similar premise. Further literature review will be done concurrently with project.
Installation and setup of CUDA on Windows and Linux platform.
Execution of a basic program with the device emulator.
Hardware has been shortlisted and will be purchased shortly.
Thread Hierarchy
A CUDA Programming Model consists of a number of grids.
Encapsulated in the grid are a number of blocks which further contain various threads.
Threads and blocks have IDs
So each thread can decide what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
KERNELS
CUD A C extends C by allowing programmer to define C functions, called kernels, that when called are executed N times in parallel by N different threads, as opposed to only once like regular C functions.
A kernel is defined using the _global_ declaration specifier and the number of CUDA threads that execute the kernel for a given kernel call is specified using a new <<<…>> > execution configuration syntax.
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIDx variable.
Memory Hierarchy
CUDA threads may access data from multiple memory spaces during their execution
Each thread has private local memory.
Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block.
All threads have access to the same global memory.
Initialization
There is no explicit initialization function for the runtime;
It initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual).
One needs to keep this in mind when timing runtime function calls and when
interpreting the error code from the first call into the runtime.
During initialization, the runtime creates a CUDA context for each device in the system. This context is the primary context for this device and it is shared among all the host threads of the application. This all happens under the hood and the runtime does not expose the primary context to the application.
cudaDeviceReset() - When a host thread calls cudaDeviceReset(), this destroys the primary context of the device the host thread currently operates on.
The next runtime function call made by any host thread that has this device as current will create a new primary context for this device.