18-10-2016, 09:48 AM
1459520341-GreenDroid.docx (Size: 47.01 KB / Downloads: 4)
Abstract:
| The Dark Silicon Age kicked o_ with
the transition to multicore and will be characterized
by a wild chase for seemingly ever-more insane architectural designs. At the heart of this transformation is the Utilization Wall, which states that, with each new process generation, the percentage of transistors that a chip can switch at full frequency is dropping exponentially due to power constraints. This has led to increasingly larger and larger fractions of a chip's silicon area that must remain passive, or dark.
Since Dark Silicon is an exponentially-worsening
phenomenon, getting worse at the same rate that
Moore's Law is ostensibly making process technology better, we need to seek out fundamentally new approaches to designi ng processors for the Dark Silicon Age.
Simply tweaking existing designs is not enough. Our research attacks the Dark Silicon problem directly through a set of energy-saving accelerators, called Conservation Cores, or c-cores. C-cores are a post multicore approach that constructively uses
dark silicon to reduce the energy consumption of an
application by 10_ or more. To examine the utility
of c-cores, we are developing GreenDroid, a multicore chip that targets the Android mobile software stack.
Our mobile application processor prototype targets a 32-nm process and is comprised of hundreds of automatically generated, specialized, patchable c-cores. These cores target special Android hotspots, including the kernel. Our preliminary results suggest that we can attain up to 11_ improvement in energy efficiency using a modest amount of silicon.
Introduction
Over the last five years, the phenomenon known as Dark Silicon has emerged as the most fundamental factor that constrains our ability to exploit the exponentially increasing resources that Moore's Law provides. Dark Silicon refers to the exponentially increasing fraction of a chip's transistors that must remain passive, or \dark" in order to stay within a chip's power budget. Although Dark Silicon has shaped processor design since the cancellation of the Pentium 4, it was not well understood why Dark Silicon was happening, nor how bad the Dark Silicon problem would get. In this paper, we begin by identifying the source of the Dark Silicon problem, and we characterize how bad the problem will get. (In short, it will be very bad; exponentially bad, in fact.) We continue the paper by describing
due to power constraints.
our approach, called Conservation Cores, or cores [3, 5], which is a way to take Dark Silicon and use it to make computation muchmore energy efficient, effectively us-
ing Dark Silicon to combat the Utilization Wall. Our approach is to use Dark Silicon to build a large collection of specialized cores, each of which can save 11_ more energy for targeted code, compared to an energy efficient general-purpose processor. We demonstrate the Conservation Cores concept by applying the technique to the Android mobile software stack in order to build a mobile application processor that runs applications with a fraction
of the energy consumption. We also examine the key scalability properties that allow Conservation Cores to targetmuch broader bodies of code than today's custom-built
accelerators.
II. Origins of Dark Silicon
To understand the Dark Silicon phenomenon better, we introduce the concept of the Utilization Wall:
Utilization Wall: With each successive pro-
cess generation, the percentage of a chip that
can switch at full frequency drops exponentially due to power constraints.
In this section, we will show three sources of evidence that we've hit the Utilization Wall [3], drawing from 1)CMOS scaling theory, 2)experiments performed in our lab, and 3)observations in the wild.
A. Scaling Theory
Moore Scaling The most elementary CMOS scaling theory is derived directly from Moore's Law. If we examine two process generations, with feature widths of say 65 nm and 32 nm, it is useful for us to employ a value S,the scaling factor, which is the ratio of the feature widths of two process generations; in this case,
S = 65=32 = 2.For typical process shrinks, S=1:4*. From elementary scaling theory, we know that available transistors scales as S2 or 28 per process generation. Prior to 2005, the number of cores in early multicore processors more or less matched the availability of transistors, growing by 2*per process generation. For instance, the MIT Raw Processor had 16 cores in 180-nm, while the Tilera TILE64 version of the chip had 64 cores in 90-nm, resulting in 4* as many cores for a scaling factor of 2*. More recently, however,
the rate has slowed to just S, or 1:4*, for reasons that we shall see shortly.
Dennardian Scaling However, the computing capabilities of silicon are not summarized simply by the number of transistors that we can integrate into a chip. To more fully understand the picture, we need to also know how the properties of transistors change as they are scaled down.To understand this better, we need to turn to Robert Den nard, who besides being the inventor of DRAM, wrote a seminal 1974 paper which set down transistor scaling [1].
Dennard's paper says that while transistor count scales by S2, the native frequency of those transistors improves by S, resulting in a net S3 improvement in computational potential of fixed-area silicon die. Thus, for typical scaling factors of 1:4*, we can expect to have a factor of 2:8* improvement in compute capabilities per process generation.
However, within this rosy picture lies a potential problem -if transistor energy efficiency does not also scale as S3, we will end up having chips that have exponentially
rising energy consumption, because we are switching S3 more transistors per unit time. Fortunately, Dennard outlined a solution to this exponential problem. First, the switching capacitance of transistors drops by a factor of
S with scaling, and if we scale the supply voltage Vdd, by S, then we reduce the energy consumption by an additional S2. As a result, the energy consumption of a transistor transition drops by S3, exactly matching the
Improvements in transistor transitions per unit time. Inshort, with standard Vdd scaling, we were able to have our transistors, AND switch them at full speed.
Post-Dennardian Scaling Starting in2005, Dennardian scaling started to break down. The root of the problem was that scaling Vddrequires a commensurate reduction in Vt, the threshold voltage of the transistor, in order to maintain transistor performance1. Unfortunately Vt reduction causes leakage to increase exponentially at a rate determined by the processes' sub-threshold slope,typically 90 mV per decade; e.g., 10_ increase in leakage
for every 90 mV reduction in threshold voltage.2 At this point in time, this leakage energy became so large that it could not be reasonably increased. As a result, Vt values
could not be scaled, and therefore neither could Vdd. The end result is that we have lost Vdd scaling as an effective way to ofset the increase in the computing potential of the underlying silicon. As a result, with each
process generation, we gain only S = 1:4* improvement in energy efficiency, which means that, under fixed power budgets, our utilization of the silicon will drop by S3=S = S2 = 2_ per process generation. This is what we mean by the Utilization Wall. Exponentially growing
numbers of transistors must be left underclocked to stay within the power budget, resulting in Dim or Dark Silicon.
B. Experiments in Our Lab
To confirm the Utilization Wall, we performed a series of experiments in our lab using a TSMC process and a standard Synopsys IC Compiler flow. Using 90-nm and 45-nm technology files, we synthesized two 40 mm2 chips filled with ALUs -32-bit adders sandwiched between two
flip-flops. Running the 90-nm chip at the native operating frequency of these ALUs, we found that only 5% of the chip could be run at full frequency in a 3-W power budget typical of mobile devices. In 45-nm, this fraction dropped
to 1.8%, a factor of 2:8*. Using ITRS projections, a 32-nm chip would drop to 0.9%. We obtained similar results for desktop-like platforms with 200 mm2 of area and an 80-W power budget.
These numbers often seem suspiciously low - after all,90-nm designs were only just beginning to experience power issues. The explanation is that RAMs typically
have 1/10 the utilization per unit area compared to datapath logic. However, the point is not so much what the exact percentage is for any process node, but rather, that once the Utilization Wall starts to become a problem, it will become exponentially worse from then on. This exponential worsening means that the onset of the problem is very quick, and is in part responsible for why industry was taken by surprise by the power problem in 2005.
C. Industrial Designs as Evidence of the Utilization Wall
The Utilization Wall is also written all over the commercial endeavors of many microprocessor companies. One salient example of a trend that reflects the Utilization Wall is the at frequency curve of processors from 2005 onward. The underlying transistors have in fact gotten much faster, but frequencies have been held at. Another example is the emergence of Intel and AMD's turbo boost feature, which allows a single core to run faster if the other cores are not in use. We are also observing an increased fraction of chips dedicated to lower frequency and lower
activity-factor logic such as L3 cache and so-called uncore logic { i.e., memory controllers and other support logic. The industrial switch to multicore is also a consequence
of the Utilization Wall. Ironically, multicore itself is not adirect solution to the UtilizationWall problem. Originally,
when multicore was proposed as a new direction, it was postulated that the number of cores would double with each process generation, increasing with transistor count.
However, this is in violation of the Utilization Wall, which says that computing capabilities can only increase at the same rate as energy efficiency improves, i.e., at a rate of S.
Looking at Intel 65-W desktop processors, with two cores in 65-nm, and four cores in 32-nm, we can compute S (=2*); and also increase in core count (= 2*), and increase in frequency (roughly constant at *3 GHz), and see that
scaling has occurred consistent with the Utilization Wall, and not with earlier predictions.One interesting observation is that the Utilization Wall says that there is a spectrum of other design points that could have been done trading of processor frequency and core count, with the extreme end being to, with each process generation, increase frequency instead of core count.
This would result in, for the previous example, two-core 32-nm processors running at *6 GHz. Conventional wisdom says that this higher frequency design would have better uni processor performance and be more preferable because it applies to all computations, not just parallel
computations. The jury is still out on this. However, for throughput-oriented computations, the higher frequency
design is still worse. The reason is that the cost of a cache miss to DRAM, as measured in ALU ops lost, is lower for lower-clocked multicore chips, so in the face of cache
misses and given sufficient throughput, higher core count is more performant than higher frequency.
III. Conservation Cores
Now that we know that Dark Silicon is an inevitable
and exponentially worsening problem, what do we do with Dark Silicon? Our group has developed a set of techniques that allow us to leverage Dark Silicon to _ght the Utilization Wall. For our research, we draw from two insights. First, power is now more expensive than area. Thus, if we can find architectural ways to trade area for power, this is a good architectural trade-off. In fact, area is a resource
that is becoming exponentially cheaper with each process generation, while power efficiency is something that requires massive engineering effort and offers diminishing
returns with conventional optimization approaches. The second insight is that specialized logic has been shown as a
promising way to improve energy efficiency by 10-1000*.As a result of these insights, we have developed an
approach that fills Dark Silicon with specialized energy saving coprocessors that save energy on commonly executed applications. The idea is that you only turn on the coprocessors as you need them, and execution jumps from
core to core according to the needs of the computation.The rest of the cores are power-gated. As a result of the specialized coprocessors, we execute the hotspots of the
computation with vastly more energy efficiency. In effect,we are recouping the S2 energy efficiency lost by the lack of Vdd scaling by using Dark Silicon to exploit specialization.
A. Related Work: Accelerators
Accelerators are another class of specialized core that has been used to improve energy efficiency and has found
Wide spread use in smartphones and other systems. Conservation Cores overcome some of the key limitations of accelerators, which include:
Speedup Fixation Accelerators fixate on speedup of target code, while energy savings is a secondary goal. In contrast, Conservation Cores target energy savings as their primary goal; performance is a secondary goal. With Dark Silicon, energy efficiency eclipses performance as a concern. Since, as we shall
see, attaining speedup versus a processor is a fundamentally harder problem than attaining energy savings, it makes sense to re-prioritize on saving energy.
_ Regular Computations Accelerators generally rely
upon exploitation of program structure for improve-
ments in performance and energy e_ciency. We refer to the code targeted by accelerators as being regu-
lar, i.e., possessing properties that make it relatively
amenable to parallelization. These properties include
moderate or high levels of parallelism, predictable
memory accesses and branch directions, and small
numbers of lines of code. Even with this structure,
accelerators tend to require human guidance, such as
#pragmas, or manual transformation, in order to at-
tain success.
_ Parallelization Required Because the transforma-
tions required to generate accelerators [2] generally
correspond to the same transformations that paral-
lelizing compilers perform (e.g., pointer analysis and
deep code transformations), accelerator generation is
seldom automated or scalable, inheriting the very
same problems that have inhibited widespread indus-
trial use of parallelizing compilers. Instead, accelera-
tor creation tends to be successful only when multi-
man-year e_orts are applied, or in cases where the
underlying algorithm in mind has been explicitly de-
signed for hardware.
_ Static Code Bases Accelerators tend to target rel-
atively static code bases that do not evolve. In many
cases, the evolution of the target code base is inten-
tionally limited through the use of standards (e.g.,
JPEG), or internal speci_cation documents.
Accelerators are thus limited in their applicability. Am-
dahl's Law tells us that the achievable bene_ts attain-
able by an optimization are limited by the fraction of the
workload the optimization targets. Thus, in order to get
widespread energy-savings, we need to broaden the applicability of coprocessors to all code, including code that
changes frequently, code that is irregular, and code that is
not parallelizable. This is the goal of Conservation Cores.
B. Conservation Core Architecture
Conservation Cores, or c-cores, are a class of specialized
coprocessors that targets the reduction of energy across all
code, including irregular code. C-cores are always paired
with an energy-e_cient general-purpose host CPU, and
perform all of their memory operations through the same
L1 data cache as the host core. Frequently-executed hot
code regions are implemented using the c-cores, while the
cold code regions are executed on the host CPU. Because
the data cache is shared, the memory system is coherent
between the c-core and host CPU, and, unlike GPUs