18-10-2016, 03:22 PM
1459690994-proj.docx (Size: 188.11 KB / Downloads: 6)
High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers. The term is most commonly associated with computing used for scientific research or computational science. A related term, high-performance technical computing (HPTC), generally refers to the engineering applications of cluster-based computing (such as computational fluid dynamics and the building and testing of virtual prototypes). Recently, HPC has come to be applied to business uses of cluster-based supercomputers, such as data warehouses, line-of-business (LOB) applications, and transaction processing.
High-performance computing (HPC) is a term that arose after the term "supercomputing." HPC is sometimes used as a synonym for supercomputing; but, in other contexts, "supercomputer" is used to refer to a more powerful subset of "high-performance computers," and the term "supercomputing" becomes a subset of "high-performance computing." In computing, hardware acceleration is the use of hardware to perform some function faster than is possible in software running on the general purpose CPU. The hardware that performs the acceleration, when in a separate unit from the CPU, is referred to as a hardware accelerator, or often more specifically as graphics accelerator or floating-point accelerator, etc. Those terms, however, are older and have been replaced with less descriptive terms like video card or graphics card.
Unlike supercomputers, HPC systems usually don’t use custom designed hard-ware and are therefore much more affordable than supercomputers, while at thesame time delivering the same performance. Accelerator-based high-performance computing (HPC) resources are used among computational scientists, and astronomy and astrophysics communities.
Computers with more than one processor unit within a compute node accessing a common memory exists since very early times of computer era. A few years ago it became quite common to have two or more processor units also in stock computers. Meanwhile all major chip manufacturers are “gluing” multiple processor units together on a single piece of silicon such that from the user’s and the operating system’s perspective a single processor chip behaves like multiple processors. The term “multicore-processor” is now widely used to describe these modern processor chips, and the term “processor core” is used to describe a single “logical” processor. This technique is frequently called “chip multiprocessing” (CMP).
Furthermore, some chip manufacturers provide a little bit of additional logic to allow for multiple programs to run simultaneously or quasi-simultaneously on a single processor core by providing multiple sets of status registers to keep multiple instruction streams alive. This technique is called “chip multithreading” (CMT). If multiple instructions from different threads are issued temporally interleaved it is called temporal multithreading, and if the processor (core) is able to execute instructions of multiple threads at the same machine cycle it is called simultaneous multithreading (SMT). A major reason for this approach is that frequently a program running on a processor (core) has to wait for data traveling from the memory chips across the board to the processor chip. During this time another program may well use the same hardware resources to do some work, thus exploiting the hardware potential of the processor chip more efficiently.
1.2 Performance Tuning Tools
What if the program has been parallelized using OpenMP and/or MPI but it still runs too slow? What if adding resources, increasing the number of OpenMP threads and/or MPI processes even slows down the program?
Parallelization always introduces synchronization and communication overhead. And then there may be leftover code regions, which have not been parallelized and thus can easily become a performance bottleneck. As an example, if 50 percent of the runtime has originally been spent in code regions which have been parallelized, the parallel code version can only run twice as fast, no matter how many processors are employed. This is called Amdahl's Law.
Performance tuning tools can be a big help to detect performance problems. Typically the program is instrumented by populating it with measuring instruments (which may in turn have a negative impact on performance) to collect information about the program execution. The program then is executed with a reasonable, carefully selected set of input data to make it behave representatively but still not take too much time. The instruments probably may have to be calibrated, adjusted, and parameterized to collect an interesting set of performance data which then is post-processed afterwards and presented to the user to provide a deeper inside in the program’s behavior and the cause of potential performance bottlenecks.
A major challenge for performance analysis is the amount of data which can easily be collected. Therefore it is important to decide which information is really of interest and how superfluous data can be filtered out.
But the information which can be delivered by these tools may give valuable insight in the program's behavior and help the programmer to improve the parallel performance.
State of Affairs
In the past few years, a new class of HPC systems has emerged. These systems employ unconventional processor architectures—such as IBM's Cell processor and graphics processing units (GPUs)—for heavy computations and use conventional central processing units (CPUs) mostly for non-compute-intensive tasks, such as I/O and communication. Prominent examples of such systems include the Los Alamos National Laboratory's Cell-based RoadRunner (ranked second on the December 2009 TOP500 list) and the Chinese National University of Defense Technology's ATI GPU-based Tianhe-1 cluster (ranked fifth on the same TOP500 list).
Currently, there's only one large GPU-based cluster serving the US computational science community—namely, Lincoln, a TeraGrid resource available at NCSA. This will be augmented in the near future by Keeneland, a Georgia Institute of Technology system funded by NSF Track 2D HPC acquisition program. On the more exotic front, Novo-G cluster, which is based on Altera field-programmable gate array (FPGA), is deployed at the University of Florida's NSF Center for High-Performance Reconfigurable Computing (CHREC). By all indications, this trend toward the use of unconventional processor architectures will continue, especially as new GPUs, such as Nvidia's Fermi, are introduced. The top eight systems on the November 2009Green500 list of the world’s most energy efficient supercomputers are accelerator-based.
Despite hardware system availability, however, the computational science community is currently split between early adopters of accelerators and skeptics. The skeptics' main concern is that new computing technologies are introduced frequently, and domain scientists simply don't have time to chase after developments that might fade away quickly.In particular, researchers working with mature and large-scale codes are typically reluctant to practice on the bleeding edge of computing technologies. From their perspective, the accelerator-based systems' long-term viability is a key question that prevents them from porting codes to these systems. Many such codes have been around much longer than the machines they were originally designed to run on. This continues to be possible because the codes were written using languages (C and Fortran) supported by a range of HPC systems.
With the introduction of application accelerators, new languages and programming models are emerging that eliminate the option to port code between "standard" and "non- standard" architectures. The community fears that these new architectures will result in the creation of many code branches that are not compatible or portable. Mature codes have also been extensively validated and trusted in the community; porting them to newly emerging accelerator architectures will require yet another round of validation. In contrast, early adopters argue that existing HPC resources are insufficient—at least for their applications—and they're willing to rewrite their codes to take advantage of the new systems' capabilities. They're concerned about (but willing to endure) the complexity of porting existing codes or rewriting them from scratch for the new architectures. They're also concerned about (but willing to deal with) the limitations and issues with programming and debugging tools for the accelerators.
Early adopters aren't overly concerned about code portability, because in their view, efforts such as OpenCL and the development of standard libraries (such as Magma, a matrix algebra library for GPU and multicore architectures) will eventually deliver on cross-platform portability. Many early adopters are still porting code kernels to a single accelerator, but a growing number of teams are starting to look beyond simple kernels and single accelerator chips.
Accelerators
3.1 Introduction
For many years microprocessor single thread performance has increased at rates consistent with Moore’s Law for transistors. In the 1970s-1990s the improvement was mostly obtained by increasing clock frequencies. Clock speeds are now improving slowly and microprocessor vendors are increasing the number of cores per chip to obtain improved performance. This approach is not allowing microprocessors to increase single thread performance at the rates customers have come to expect. Alternative technologies include:
General Purpose Graphical Processing Units (GPGPUs)
• Field Programmable Gate Arrays (FPGAs) boards
• ClearSpeed’s floating-point boards
• IBM’s Cell processors
These have the potential to provide single thread performance orders of magnitude faster than current “industry standard” microprocessors from Intel and AMD. Unfortunately performance expectations cited by vendors and in the press are frequently unrealistic due to very high theoretical peak rates, but very low sustainable ones.
Many customers are also constrained by the energy required to power and cool today’s computers. Some accelerator technologies require little power per Gflop/s of performance and are attractive from this reason alone. Others accelerators require much more power than can be provided by systems such as blades. Finally, the software development environment for many of the technologies is cumbersome at best to nearly non-existent at worst. These accelerator devices contain a large number of processing cores, as well as internal memory. They are most often used in conjunction with the CPUs of the node to accelerate certain ‘hot spots’ of a computation that also definitely requires a very large amount of the algebraic operations.
The HPC space is challenging since it is dominated by applications that use 64-bit floating-point calculations and these frequently have little data reuse. HPCD personnel are also doing joint work with software tool vendors to help ensure their products work well for the HPC environment. This report gives an overview of accelerator technologies, the HPC applications space, hardware accelerators. Recommendations on which technologies hold the most promise, and speculations on the future of these technologies.
3.2 Accelerator background
Accelerators are computing components containing functional units, together with memory and control systems that can be easily added to computers to speed up portions of applications.
They can also be aggregated into groups for supporting acceleration of larger problem sizes. Each accelerator being investigated has many (but not necessarily all) of the following features.
• A slow clock period compared to CPUs
• Aggregate high performance is achieved through parallelism
• Needs lots of data reuse for good performance
• The fewer the bits, the better the performance
• Integer is faster than 32-bit floating-point which is faster than 64-bit floating-point
• Learning the theoretical peak is difficult
• Software tools lacking
• Requires programming in languages designed for the particular technology.
3.3 High Performance Computing Considerations
There are many metrics that can be used to measure the benefit of accelerators. Some important ones to consider are:
• Price/performance– the more costly the accelerator, the faster it must be to succeed.
• Computational density for system (want to increase Gflop/s / cubic meter) – accelerators can improve this significantly.
• Power considerations (want to increase Gflop/s / watt) – some technologies require very little power while other require so much they can’t be used in low power systems
• Cluster system Mean Time Between Failure (want to increase Gflop/s * MTBF) – if accelerators allow a reduction in node count, the MTBF may improve significantly.
Types of Accelerators in use
4.1 GPU
4.1.1 Introduction
A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates 3D or 2D graphics rendering from the microprocessor. It is used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a dedicated video card.
A GPU (Graphics Processing Unit) is a processor attached to a graphics card dedicated to calculating floating point operations. A graphics accelerator incorporates custom microchips which contain special mathematical operations commonly used in graphics rendering. The efficiency of the microchips therefore determines the effectiveness of the graphics accelerator. They are mainly used for playing 3D games or high-end 3D rendering. A GPU implements a number of graphics primitive operations in a way that makes running them much faster than drawing directly to the screen with the host CPU. The most common operations for early 2D computer graphics include in this kind of the BitBLT operation, combining several bitmap patterns using a RasterOp, usually in special hardware called a "blitter", and these are those operations which are used for this drawing rectangles,triangles, circles, and arcs. Modern GPUs also have support for 3D computer graphics, and typically include digital video–related functions.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous co- processing computing model. The sequential part of the application runs on the CPU and
the computationally-intensive part accelerated by the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.
The IBM Professional Graphics Controller was one of the very first 2D/3D graphics accelerators available for the IBM PC.
As the processing power of GPUs has increased, so has their demand for electrical power. High performance GPUs often consume more energy than current CPUs. Another characteristic of high performance GPUs is that they require a lot of power (and hence a lot of cooling). So they’re fine for a workstation, but not for systems such as blades that are heavily constrained by cooling. However, floating-point calculations require much less power than graphics calculations. So a GPU performing floatingpoint code might use only half the power of one doing pure graphics code. Most GPUs achieve their best performance by operating on four-tuples each of which is a 32-bit floating-point number. These four components are packed together into a 128-bit word which isoperated on as a group. So it’s like a vector of length four and similar to the SSE2 extensions on x86 processors. The ATI R580 has 48 functional units each of which can perform a 4-tuple per cycle and each of those can perform a MADD instruction. At a frequency of 650
MHz, this results in a rate of 0.65 GHz × 48 functional units × 4 per tuple × 2 flops per MADD = 250 Gflop/s. The recent NVIDIA G80 GPU takes a different approach since it includes 32-bitfunctional units instead of 128-bit ones. Each of the 128 scalar units runs at 1.35 GHz and can perform a single 32-bit floating-point MADD operation so its theoretical peak is 1.3 GHz × 128 functional units × 2 flops per MADD = 345 Gflop/s. Unfortunately GPUs tend to have a small number of registers so measured rates are frequently less than 10% of peak. GPUs do have veryrobust memory systems that are faster (but smaller) than that of CPUs. Maximum memory per GPU is about 1 GB and this memory bandwidth may exceed 40 GB/s.
Today, parallel GPUs have begun making computational inroads against the CPU, and a subfield of research, dubbed GPU Computing or GPGPU for General Purpose Computing on GPU, has found its way into fields as diverse as oil exploration, scientific image processing, linear algebra[4], 3D reconstruction and even stock
options pricing determination. Nvidia's CUDA platform is the most widely adopted programming model for GPU computing, with OpenCL also being offered as an open standard.
4.1.2GPU Classes
The GPUs of the most powerful class typically interface with the motherboard by means of an expansion slot such as PCI Express (PCIe) or Accelerated Graphics Port (AGP) and can usually be replaced or upgraded with relative ease, assuming the motherboard is capable of supporting the upgrade. A few graphics cards still use Peripheral Component Interconnect (PCI) slots, but their bandwidth is so limited that they are generally used only when a PCIe or AGP slot is not available.
A dedicated GPU is not necessarily removable, nor does it necessarily interface with the motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated graphics cards have RAM that is dedicated to the card's use, not to the fact that most dedicated GPUs are removable. Dedicated GPUs for portable computers are most commonly interfaced through a non-standard and often proprietary slot due to size and weight constraints. Such ports may still be considered PCIe or AGP in terms of their logical host interface, even if they are not physically interchangeable with their counterparts. Technologies such as SLI by NVIDIA and CrossFire.
4.1.2.1 Integrated graphics solutions
Integrated graphics solutions, shared graphics solutions, or Integrated graphics processors (IGP) utilize a portion of a computer's system RAM rather than dedicated graphics memory. Computers with integrated graphics account for 90% of all PC shipments. These solutions are less costly to implement than dedicated graphics solutions, but are less capable. Historically, integrated solutions were often considered unfit to play 3D games or run graphically intensive programs but could run less intensive programs such as Adobe Flash. Examples of such IGPs would be offerings from SiS and VIA circa 2004. However, today's integrated solutions such as AMD's Radeon HD 3200 (AMD 780G chipset) and NVIDIA's GeForce 8200 (nForce 710|NVIDIA nForce 730a) are more than capable of handling 2D graphics from Adobe Flash or low stress 3D graphics. However, most integrated graphics still struggle with high-end video games. Chips like the Nvidia GeForce 9400M in Apple's MacBook and MacBook Pro and AMD's Radeon HD 3300 (AMD 790GX) have an improved performance, but still lag behind dedicated graphics cards. Modern desktop motherboards often include an integrated graphics solution and have expansion slots available to add a dedicated graphics card later.
As a GPU is extremely memory intensive, an integrated solution may find itself competing for the already relatively slow system RAM with the CPU, as it has minimal or no dedicated video memory. System RAM may be 2 Gbit/s to 12.8 Gbit/s, yet dedicated GPUs enjoy between 10 Gbit/s to over 100 Gbit/s of bandwidth depending on the model. Older integrated graphics chipsets lacked hardware transform and lighting, but newer ones include it.
4.1.2.2 Hybrid solutions
This newer class of GPUs competes with integrated graphics in the low-end desktop and notebook markets. The most common implementations of this are ATI's HyperMemory and NVIDIA's TurboCache. Hybrid graphics cards are somewhat more expensive than integrated graphics, but much less expensive than dedicated graphics cards. These share memory with the system and have a small dedicated memory cache, to make up for the high latency of the system RAM. Technologies within PCI Express can make this possible. While these solutions are sometimes advertised as having as much as 768MB of RAM, this refers to how much can be shared with the system memory.
4.1.2.3 Stream Processing and General Purpose GPUs (GPGPU)
A new concept is to use a general purpose graphics processing unit as a modified form of stream processor. This concept turns the massive floating-point computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power, as opposed to being hard wired solely to do graphical operations. In certain applications requiring massive vector operations, this can yield several orders of magnitude higher performance than a conventional CPU. The two largest discrete (GPU designers, ATI and NVIDIA, are beginning to pursue this new approach with an array of applications. Both nVidia and ATI have teamed with Stanford University to create a GPU-based client for the Folding@Home distributed computing project, for protein folding calculations. In certain circumstances the GPU calculates forty times faster than the conventional CPUs traditionally used by such applications.
Recently NVidia began releasing cards supporting an API extension to the C programming language CUDA ("Compute Unified Device Architecture"), which allows specified functions from a normal C program to run on the GPU's stream processors. This makes C programs capable of taking advantage of a GPU's ability to operate on large matrices in parallel, while still making use of the CPU when appropriate. CUDA is also the first API to allow CPU-based applications to access directly the
resources of a GPU for more general purpose computing without the limitations of using a graphics API.
Since 2005 there has been interest in using the performance offered by GPUs for evolutionary computation in general, and for accelerating the fitness evaluation in genetic programming in particular. Most approaches compile linear or tree programs on the host PC and transfer the executable to the GPU to be run. Typically the performance advantage is only obtained by running the single active program simultaneously on many example problems in parallel, using the GPU's SIMD architecture. However, substantial acceleration can also be obtained by not compiling the programs, and instead transferring them to the GPU, to be interpreted there. Acceleration can then be obtained .
4.1.3 Hardware
There are two dominant producers of high performance GPU chips: NVIDIA and ATI. ATI was purchased by AMD in November 2006. Until recently both GPU companies were very secretive about the internals of their processors. However, now both are opening up their architecture to encourage third party vendors to produce better performing product. ATI’s has their Close To Metal (CTM) API. This is claimed to be an Instruction Set Architecture (ISA) for ATI GPUs so that software vendors can develop code using the CTM instructions instead of writing everything in graphics languages. This will make software development easier and will lead to improved performance. NVIDIA is taking a different approach in that they’ve announced their CUDA program for their latest generation GPUs. CUDA started with the C language, added some new extensions and produced a compiler for the language. Software vendors will write code for CUDA instead of graphics code to achieve improved performance. It remains to be seen which approach is best.
AMD has also announced the Fusion program which will place CPU and GPU cores on a single chip by 2009. An open question is whether the GPU component on the Fusion chips will be performance competitive with ATI’s high power GPUs.
4.1.4 Software
Most GPU programs are written in a shader language such as OpenGL (Linux, Windows) or HLSL (Windows). These languages are very different from C or Fortran or other common high level languages usually used by HPC scientists. Hence arose the need to explore other languages that would be more acceptable to HPC users.
The most popular alternative to shader languages are streams languages – so named because they operate on streams (vectors of arbitrary length) of data. These are well suited for parallelism and hence GPUs since element in a stream can be operated on by a different functional unit. The first two streams languages for GPUs were BrookGPU and Sh (now named RapidMind). BrookGPU is a language that originated in the Stanford University Graphics Lab to provide a general purpose language for GPUs. This language contains extensions to C that can be used to operate on the four-tuples with single instructions. This effort is currently in maintenance mode because its creator has left Stanford so our team is not pursuing it. However in October 2006, PeakStream announced their successor to BrookGPU. Although they claim their language is really C++ with new classes it looks like a new language. They have created some 32-bit floating-point mathematical routines and we’re in the process of evaluating them. PeakStream also is working closely with AMD/ATI, but not with NVIDIA.
The other language we investigated for programming GPUs is RapidMind. This is effort started at the University of Waterloo and led to founding the company RapidMind to productize the language and compilers. This is a language that is embedded in C++ programs and allows GPUs to be abstracted without directly programming in a shader language. While this language is based in graphics programming it is also a general purpose language that can be used for other technical applications. Also, the user does not have to directly define the data passing between the CPU and GPU as the RapidMind
compiler takes care of setting up and handling this communication. Since this language was the only viable GPU language suitable for our market, the authors began a series of technical exchanges with RapidMind personnel. RapidMind has also simplified the syntax of their language to make it easier to use.
4.2 FPGAs
4.2.1 Introduction
Field Programmable Gate Arrays ( FPGPAs) have a long history in embedded processing and specialized computing. These areas include DSP, ASIC prototyping, medial imaging, and other specialized compute intensive areas.
An important differentiator between FPGAs and other accelerators is that they are programmable. You can program them for one algorithm and then reprogram them to do a different one. This reprogramming step may take several milliseconds, so it needs to be done in anticipation of the next algorithm needed to be most effective. FPGA chips seem primitive compared to standard CPUs since some of the things that are basic on standard processors require a lot of effort on FPGAs. For example, CPUs have functional units that perform 64-bit floating-point multiplication as opposed to FPGAs that have primitive low-bit multiplier units that must be pieced together to perform a 64-bit floating-point multiplication. Also, FPGAs aren’t designed to hold a large number of data items and instructions, so users have to consider exactly how many lines of code will be sent to the FPGA. Thousands of lines, for example, would exceed the capacity of most FPGAs.
Compared to modern CPUs, FPGAs run at very modest speeds – on the order of 200-600 MHz. This speed is dependent on the overall capability of the device and the complexity of the design being targeted for it. The key to gaining performance from an FPGA lies in the ability to highly pipeline the solution and having multiple pipelines active concurrently.
Running code on FPGAs is cumbersome as it involves some steps that are not necessary for CPUs. Assume an application is written in C/C++. Steps include:
• Profile to identify code to run on FPGA
• Modify code to use FPGA C language (such as Handel-C, Mitrionics, etc.)
• Compile this into a hardware description language (VHLD or Verilog)
• Perform FPGA place-and-route and product FPGA “bitfile”
• Download bitfile to FPGA
• Compile complete application and run on host processor and FPGA
For example, the latest generation and largest Xilinx Virtex-5 chip has 192 25x18 bit primitive multipliers. It takes 5 of these to perform a 64-bit floating-point multiply and these can run at speeds up to 500 MHz. So an upper limit on double precision multiplication is [192/5] * 0.5 = 19 Gflop/s. A matrix-matrix multiplication includes multiplications and additions and the highest claim seen for a complete DGEMM is about 4 Gflop/s, although numbers as high as 8 Gflop/s have been reported for data local to the FPGA. Cray XD-1 results using an FPGA that is about half the size of current FPGAs show DGEMM and double precision 1-d FFTs performing at less than 2 Gflop/s. Single precision routines should run several times faster. FPGAs are very good at small integer and floating-point calculations with a small number of bits. The manager of one university reconfiguration computing site noted: "If FPGAs represent Superman, then double precision calculations are kryptonite."
One HPC discipline that is enamored with FPGAs is astronomy. The current largest very long baseline Interferometry system, LOFAR, has at its heart an IBM Blue Gene system with a peak of 34 Tflop/s. Most of the processing on the Blue Gene systems uses 32-bit floating-point calculations. The next generation system, SKA, to be delivered in the late 2010s will need processing power in the 10-100 Petaflop/s range. The most time consuming algorithms don’t need 32-bit computations - 4-bit complex data and calculations are sufficient. Therefore many of these astronomers are experimenting with FPGAs since three FPGA chips can produce performance that exceeds the equivalent of a Tflop/s. FPGAs belong to a class of products known as Field Programmable Logic devices (FPLD). The traditional and dominant type of FPLD is FPGAs. Recently other types of FPLD have emerged including FPOA (Field Programmable Object Arrays) and FPMC (Field ProgrammableMultiCores).
4.2.2 Hardware
The dominant FPGA chip vendors are Xilinx and Altera. Both companies produce many different types of FPGAs. Some FPGAs are designed to perform integer calculations while others are designed for floating-point calculations. Each type comes in many different sizes, so most HPC users would be in interested in the largest (but most expensive) FPGA that is optimized for floating-point calculations. Other chip companies are the startup Velogix (FPGAs) and MathStar (FPOAs).
4.2.3 Software
Once again the software environment is not what the HPC community is used to using. There’s a spectrum of FPGA software development tools. At one end is the popular hardware design language Verilog used by hardware designers. This has very good performance, but the language is very different from what HPC researches expect. Some vendors have solutions that are much closer to conventional C++. The conventional wisdom is that the closer to standard C that the solution is, the worse the resulting application performs. The reason for this is that to make the best use of FPGAs, users should define exactly how many bits they would like to use for each variable and calculation. The smaller the number of bits, the less space is required on the die, so more can be contained on a chip, and hence the better the performance. One company used by multiple HPC vendors is Celoxica. Its Handel C language allows users to exactly define the data size for all variable and calculations. The HPC accelerator team has begun implementing HPC algorithms in Handel-C to gauge its easy of use and performance.
Another language that has potential is Mitrionics’ Mitrion-C programming language for FPGAs.There are also other FPGA C language variants such as Impulse C and Dime-C.
4.3 Clear Speed’s Floating Point Accelerators
ClearSpeed Technology produces a board that is designed to accelerate floating-point calculations. This board plugs into a PCI-X slot, has a clock cycle of 500 MHz, and contains 96 floating-point functional units that can each perform a double precision multiply-add in one cycle. Therefore their board has a theoretical peak of 96 Gflop/s. In late 2006 ClearSpeed previewed boards that are connected to systems by a PCI-e slot. This will help performance get closer to their peak rates.
Clearspeed has a beta release of a software development kit that includes a compiler. There are two ways to use the ClearSpeed boards. One is to make a call to a routine from their math library. This library contains an optimized version of the matrix-matrix multiply subprogram DGEMM. The other way to access the boards is to write routines using the ClearSpeed accelerator language Cn. See the “Investigations Finding” for more performance information. The first accelerator enhanced system to make the Top500 list is the TSUBAME grid cluster in Tokyo. It is entry 9 on the November 2006 list and derives about ¼ of its performance from ClearSpeed boards and the rest from Opteron processors.
True to its focus on energy efficient HPC, ClearSpeed made power consumption the priority in the product refresh. In general, the three new offerings deliver slightly better raw performance than the previous generation, but much better performance per watt and double the memory capacity. With the products announced today, ClearSpeed is promising approximately 4 double precision (DP) gigaflops per watt. This is significantly higher than what the latest GPU, Cell or FPGA accelerators are able to offer.
The key to the new ClearSpeed products is the 90nm CSX700 processor, which replaces the 130nm CSX600. The CSX700 is a much more powerful chip than its predecessor, with twice the number of processing elements (196), two memory controllers, and an integrated PCIe x16 controller. The new processor deliver 96 DP gigaflops — almost four times that of the CSX600 — and sips just 12 watts under maximum load.
The more capable processor allowed the company to replace the dual-CSX600 configuration on the previous generation Advance boards with a single CSX700. ClearSpeed also saved a bit on overall power and cost by ditching the off-chip FPGA that acted as the PCI controller, a function that is now integrated on the chip. Each board has 2 GB of DDR2 memory — again, twice as much as their predecessors. The result is that the new e710 and e720 Advance boards each achieve 96 DP gigaflops and draw just 25 watts of power. By comparison, the new NVIDIA 4-GPU S1070 board due out this August will achieve about 400 DP gigaflops at 700 watts, and AMD says its new FireStream 9250 will deliver over 200 DP gigaflops from 150 watts.
The e710 and e720 are functionally identical; they just have different form factors. The e710 is a low-profile, half-length PCIe board that slides into any standard PCIe-equipped server, while the e720 is a type 2 mezzanine card that fits inside an HP blade. The accelerator talks to the host at 2 GB/sec over PCIe x8. No extra power or cooling is required, which makes them easy to add to existing setups. No fans mean no moving parts, so there are no mechanical breakdowns to worry about. MRSP for the boards is $3,570, with an expected street price of under $3,000, in volume.
To take reliability to the next level, all memory, both on-board and on-chip, is error checked and corrected (ECC), which means protection from soft errors. The ECC support is a big deal, since the current GPU products for HPC offered by NVIDIA and AMD currently don’t support this; application code that can’t survive soft errors must go elsewhere to compute.
Since the new CATS-700 1U server makes use of the upgraded e710 boards, ClearSpeed has managed to deliver slightly more performance and almost double the energy efficiency compared the CAT-600 box that was demonstrated at SC07 in Reno. Using 12 of the new Advance boards, the CATS-700 provides 1.152 DP teraflops and 24 GB of memory. A single box uses just 400 watts, which is about the same as a typical low-power x86 1U server. A half-rack of 18 CATS-700 units hooked up to a half-rack of 18 quad-core x86 servers will yield over 22 peak teraflops — more than enough raw performance to earn a spot on the TOP500 list.
ClearSpeed’s main competitors are NVIDIA and AMD, both of whom have introduced 64-bit floating point support in their GPGPU product lines. With all three companies now in the double precision business, each is jockeying for position in a rapidly developing HPC accelerator market. ClearSpeed guessed correctly the latest generation GPUs would deliver only a fraction of their total single precision performance as double precision, which allowed the company to maintain a significant performance/watt advantage for DP math. And since neither GPU computing vendor offers ECC memory, only ClearSpeed can claim error correction and soft error protection.
“It looks like we’re going to be in really good position with our 64-bit performance and price-performance and really out in front in terms of performance per watt.” said Simon McIntosh-Smith, ClearSpeed’s VP of Customer Applications.
ClearSpeed’s strategy is to claim the HPC high ground against its GPU accelerator competition. The company says its emphasis on reliability and its focus on high performance computing gives its product the edge for HPC acceleration. With high double precision energy efficiency, system-wide protection from memory errors, and high MTBF, the company believes it offers a much more practical architecture than GPUs for highly scaled-out systems, especially as the petascale level is reached.
But ClearSpeed’s ability to provide the most energy efficient platform for HPC arithmetic is what sets it apart from the pack. But even in the mid-range market, escalating oil prices and global climate concerns are causing everyone to rethink their HPC datacenter power budgets. At this point, the world’s energy and climate problems might turn out to be ClearSpeed’s best advantage.
4.4 IBM Cell Processors
4.4.1 Introduction
Cell is a microprocessor architecture jointly developed by Sony Corporation Sony Computer Entertainment, IBM, and Toshiba an alliance known as "STI". Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated for the CBEA in full or Cell BE in part. Cell combines a general-purpose Power Architecture core of modest performance with treamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.It was initially designed for the Playstation 3. Although the Cell was not on the original list of technologies to evaluate it has become the mind share leader of acceleration technologies. The QS22 based on the PowerXCell 8i processor is used for the IBM Roadrunner supercomputer.
The Cell Broadband Engine—or Cell as it is more commonly known—is a microprocessor designed to bridge the gap between conventional desktop processors (such as the Athlon 64, and Core 2 families) and more specialized high- performance processors, such as the NVIDIA and ATI graphics-processors (GPUs). The longer name indicates its intended use, namely as a component in current and future digital distribution systems; as such it may be utilized in high-definition displays and recording equipment, as well as computer entertainment systems for the HDTV era. Additionally the processor may be suited to digital imaging systems (medical, scientific, etc.) as well as physical simulation (e.g., scientific and structural engineering modeling).