28-11-2012, 11:48 AM
Seminar Report PEPSCOWER EFFICIENT PROCESSOR FOR SCIENTIFIC COMPUTING
1PEPSC.docx (Size: 682.71 KB / Downloads: 33)
ABSTRACT
The rapid advancements in the computational capabilitiesof the graphics processing unit (GPU) as well as thedeployment of general programming models for these deviceshave made the vision of a desktop supercomputer a reality. It isnow possible to assemble a system that provides several TFLOPsof performance on scientific applications for the cost of a highendlaptop computer. While these devices have clearly changedthe landscape of computing, there are two central problemsthat arise. First, GPUs are designed and optimized for graphicsapplications resulting in delivered performance that is far belowpeak for more general scientific and mathematical applications.Second, GPUs are power hungry devices that often consume100-300 watts, which restricts the scalability of the solution andrequires expensive cooling. To combat these challenges, this paperpresents the PEPSC architecture – an architecture customized forthe domain of data parallel scientific applications where powerefficiencyis the central focus. PEPSC utilizes a combinationof a two-dimensional single-instruction multiple-data (SIMD)datapath, an intelligent dynamic prefetching mechanism, anda configurable SIMD control approach to increase executionefficiency over conventional GPUs. A single PEPSC core has apeak performance of 120 GFLOPs while consuming 2W of powerwhen executing modern scientific applications, which representsan increase in computation efficiency of more than 10X overexisting GPUs.
INTRODUCTION
Scientists have traditionally relied on large-scale supercomputers to deliver the computational horsepower to solve their problems. This landscape is rapidly changing as relatively cheap computer systems that deliver supercomputer-level performance can be assembled from commodity multicore chips available from Intel, AMD, and Nvidia. For example, the Intel xeon X7560, which uses the Nehalem microarchitecture, hasa peak performance of 144 GFLOPs (8 cores, each with a 4-wide SSE unit, running at 2.266 GHz) with a total power dissipation of 130 Watts. The AMD Radeon 6870 graphics processing unit (GPU) can deliver a peak performance of nearly 2 TFLOPs (960 stream processor cores running at 850 MHz)with a total power dissipation of 256 Watts.For some applications, including medical imaging, electronic design automation, physics simulations, and stockpricing models, GPUs present a more attractive option in terms of performance, with speedups of up to 300X over conventional x86 processors However,these speedups are not universal as they depend heavily on both the nature of the application as well as the performance optimization applied by the programmer .
APPLICATION ANALYSIS ON GPUs
GPUs are currently the preferred solution for scientificcomputing, but they have their own set of inefficiencies. Inorder to motivate an improved architecture, we first analyzethe efficiency of GPUs on various scientific and numericalapplications. While the specific domains that these applicationsbelong to vary widely, the set of applications usedhere, encompassing several classes of the Berkeley “dwarf”
taxonomy [?] is representative of non-graphics applicationsexecuted on GPUs.
APPLICATION ANALYSIS
Ten benchmarks were analyzed. The source code for these applications is derived from a variety of sources, including the Nvidia CUDA software development kit, the GPGPU-SIM [1] benchmark suite, the Rodinia [2] benchmark suite, the Parboil
benchmark suite, and the Nvidia CUDA Zone.
GPU UTILIZATION
We analyze the benchmarks’ behavior on GPUs using theGPGPU-SIM simulator.
Figure 2.1 illustrates the performance of each of our benchmarks and the sources of underutilization. “Utilization” here is the percentage of the theoretical peak performance of the simulated architecture actually achieved by each of these
benchmarks. Idle times between kernel executions when data was being transferred between the GPU and CPU were not considered.
REDUCING MEMORY STALLS
There are a few different alternatives when trying to mitigateproblems with off-chip memory latency. Large caches offer adense, lower-power alternative to register contexts to store thedata required in future iterations of the program kernel. Even
though modern GPUs have very large caches, these are oftenin the form of graphics-specific texture caches, and not easilyused for other applications. Further, many scientific computingbenchmarks access data in a streaming manner – values that
are loaded are located in contiguous, or fixed-offset, memorylocations and computed results are also stored in contiguouslocations and are rarely ever reused. This allows for creatinga memory system that can easily predict what data is requiredwhen. Some GPUs have small, fast shared memory structuresbut they are generally software-managed and, as such, it isdifficult to accurately place data in them exactly when it isrequired.
Stride Prefetcher:
A conventional stride prefetcher consists of the “prefetch table” – a table tostore the miss address of a load instruction, the confidence ofprefetching, and the access stride. The program counter value (PC) of the load instruction is used as a unique identifier toindex into the prefetch table.
Dynamic Degree Prefetcher:
Stride prefetchers oftenhave a notion of degree associated with them, indicatinghow early data should be prefetched. In cyclic code, it isthe difference between the current loop iteration number andthe iteration number for which data is being prefetched. Atraditional stride prefetcher uses a degree of one for allthe entries in the prefetch table. With large loop bodies,degree-one prefetchers perform well as the time required forprefetching data is hidden by the time taken to execute