11-10-2016, 02:16 PM
1458632730-sujitpasalkarreportgreendroid.docx (Size: 560.58 KB / Downloads: 24)
ABSTRACT
Mobile application processors are soon to replace desktop processors as the focus of innovation in microprocessor technology. Already, these processors have largely caught up to their more power hungry cousins, supporting out-of order execution and multicore processing. In the near future, the exponentially worsening problem of dark silicon is going to be the primary force that dictates the evolution of these designs. In recent work, we have argued that the natural evolution of mobile application processors is to use this dark silicon to create hundreds of automatically generated energy-saving cores, called conservation cores, which can reduce energy consumption by an order of magnitude. This article describes GreenDroid, a research prototype that demonstrates the use of such cores to save energy broadly across the hotspots in the Android mobile phone software stack.
The GreenDroid mobile application processor is a 45-nm multicore research prototype that targets the Android mobile-phone software stack and can execute general-purpose mobile programs with 11 times less energy than today’s most energy-efficient designs, at similar or better performance levels. It does this through the use of a hundred or so automatically generated, highly specialized, energy-reducing cores, called conservation cores.
1.1 Introduction to GreenDroid:
Mobile devices have recently emerged as the most exciting and fast-changing segment of computing platforms. A typical high-end smartphone or tablet contains a panoply of processors, including a mobile application processor for running the Android or iPhone software environments and user applications and games, a graphics processor for rendering on the user’s screen, and a cellular baseband processor for communicating with the cellular networks. In addition to these flexible processors, there are more specialized circuits that implement Wi-Fi,
Bluetooth and GPS connectivity as well as accelerator circuits for playing and recording video
And sound.
As a larger percentage of cellular network traffic becomes data rather than voice, the capabilities of the mobile application processor that generates this data have become exponentially more important. In recent years, we have seen a corresponding exponential improvement in the capabilities of mobile application processors, so these processors are now approaching similar
level of sophistication to those in desktop machines. In fact, this process parallels similar progress that happened when desktop processors mirrored the development of earlier mainframe computers. As of 2010, mobile application processors have already integrated the most significant innovations of processor architectures of the last 50 years, integrating multiple out-of-order, superscalar, pipelined cores in a single die.
As Moore’s Law and complementary metal oxide semiconductor (CMOS) scaling provide improving energy efficiencies and transistor counts, cheaper processors eventually are able to incorporate the features from their older relatives; first, pipelined execution, then superscalar execution, then out-of-order execution, and finally, multicore. Today, because sales quantities are higher, processor features tend to move from desktop processor designs to mainframe
designs rather than in the opposite direction. As mobile application processor sales surpass those
of desktops, it is likely that smartphone processors will become the new nexus of advancement
in processor design.
1.2 Identified Problem: Utilization Wall
Our research attacks a key technological problem for microprocessor architects, which we call the utilization wall.1 The utilization wall says that, with each process generation, the percentage of transistors that a chip design can switch at full frequency drops exponentially because of power constraints. A direct consequence of this is dark silicon—large swaths of a chip’s silicon area that must remain mostly passive to stay within the chip’s power budget. Currently, only about 1 percent of a modest-sized 32-nm mobile chip can switch at full frequency within a 3-W power budget.
With each process generation, dark silicon gets exponentially cheaper, whereas the power budget is becoming exponentially more valuable. Our research leverages two key insights. First, it makes sense to find architectural techniques that trade this cheap resource, dark silicon, for the more valuable resource, energy efficiency. Second, specialized logic can attain 10_ to 1,000_ better energy efficiency over general-purpose processors. Our approach is to fill a chip’s
dark silicon area with specialized cores to save energy on common applications. These
cores are automatically generated from the code base that the processor is intended to
run—that is, the Android mobile-phone software stack. The cores feature a focused reconfigurability so that they can remain useful even as the code they target evolves.
The utilization wall dictates that due to poor CMOS scaling, improvements in processor performance are determined not by improvements in transistor frequency or transistor count, but rather by the degree to which each process shrink reduces the switching energy of the underlying
transistors. Because transistor counts are growing much faster than the underlying energy efficiency is improving, a direct consequence of this is the phenomenon of dark silicon — that is,
large swaths of a chip’s silicon area that must remain mostly passive in order to stay within the
chip’s power budget. As we show later in this article, only 1 percent or so of a modest sized 32 nm mobile chip can switch at full frequency within a 3 W power budget. The dark silicon problem is directly responsible for the desktop processor industry’s decision to stop scaling
clock frequencies and instead build multicore processors. It will play an equally pivotal role in
shaping the future of mobile processors as well.
With each process generation, dark silicon is a resource that gets exponentially cheaper, while the power budget becomes exponentially more valuable in comparison.
3 Motivation
Dark silicon has emerged use in modern processor design.
Android : is well-suited for c-cores . Has a Linux kernel, a set of application libraries, and a virtual machine called Dalvik.
Because the hot code is well concentrated, targeting all components with c-cores attains high coverage over the source base and a significant impact on overall energy usage.
1.3 Purpose And Scope
Over the next five to 10 years, the breakdown of conventional silicon scaling and the resulting utilization wall will exponentially increase the amount of dark silicon in both desktop and mobile processors. The GreenDroid prototype demonstrates that c-cores offer a new technique to convert dark silicon into energy savings and increased parallel execution under strict power budgets. We estimate that the prototype will reduce processor energy consumption by 91 percent for the code that c-cores target, and result in an overall savings of energy.
The GreenDroid processor design effort is steadily marching toward completion: Our tool chain automatically generates placed-and-routed c-core tiles, given the
source code and information about execution properties. Our cycle- and energy-accurate
simulation tools have confirmed the energy savings provided by c-cores. We’re currently working on more detailed full-system Android emulation to improve our workload modeling so that we can finalize the selection of c-cores that will populate GreenDroid’s dark silicon. In parallel with this effort, we’re working on timing closure and physical design.
2.1.1 Understanding the Origins of the Utilization Wall:
Historically, Moore’s Law has been the engine that drives growth in the underlying computing capability of computing devices. Although we continue to have exponential improvements in the number of transistors we can pack into a single chip, this is not enough to maintain historic growth in processor performance. Dennard’s 1974 paper detailed a roadmap for scaling CMOS devices, which, since the middle of the last decade, has broken down. This breakdown has fundamentally changed the way that all high performance digital devices are designed today. One consequence of this breakdown was the industry-wide transition from single-core processors to multicore processors. The consequences are likely to be even more far-reaching going into the future.
In this subsection, we outline a simple argument that shows the difference between historical CMOS scaling and today’s CMOS scaling. The overall consequence is that, although transistors continue to get exponentially more numerous and exponentially faster, overall system performance of current architectures is largely unaffected by these factors. Instead, system performance is driven by the degree to which transistors get more energy efficient with each process generation — approximately at the same rate at which the capacitance of those transistors drops as they shrink.
Each transistor transition imparts an energy cost, and the sum of all of these transitions must stay within the active power budget of the system. This power budget is set by either thermal limitations (e.g., the discomfort of placing a 100 W device next to your face) or battery limitations (e.g. a 6 Wh battery that must last for 8 hours of active use can only average 750 mW over that time period.) As we see shortly, in current systems, it is easy to exceed this budget with only a small percentage of the total transistors on a chip.
This argument is summarized in Table 2.1.1, which takes as input variable a scaling factor S, which describes the ratio between the feature sizes of two processes (e.g., S = 45/32 = 1.4x between 45 nm and 32 nm technology). In “classical” (i.e., pre-2005) scaling proposed by Denard, we are able to scale the threshold voltage and operating voltage together. Currently, we are in a “leakage-limited” regime where we cannot decrease lower threshold and operating voltages without exponentially increasing either transistor delay or leakage.
In both regimes, full-chip transistor counts increase by S2, the native switching frequency of transistors increases by S, and capacitance decreases by 1/S. However, the two cases differ in operating voltage (Vdd) scaling: with classical scaling, Vdd decreases by 1/S, but with leakage limited scaling, Vdd stays fixed. When transitioning to another process generation, the change in power consumption is the product of these terms and an additional factor of Vdd.
Thus, currently, the only factor decreasing Power consumption as we move to a new process generation is the reduction of capacitance per transistor, at a rate of 1/S, while the other factors
are increasing it by S3. As shown in Table2.1.1, in classical scaling, using all of the chip area for transistors running at maximum frequency would result in constant power between process generations, and we retain the ability to utilize all of the chip resources. Today, doing the same would increase power consumption by S2. Since power budgets are constrained in real systems, we must instead reduce utilization of chip resources by 1/S2 (i.e., 2× with each process generation). Effectively, a greater and greater fraction of the silicon chip will have to be dark silicon.
2.2 Existing Methodologies:
2.2.1 CMOS Scaling:
Table 2.1.1 shows how transistor properties change with each process generation, where S is the scaling factor. For instance, when moving from a 45-nm to a 32-nm process generation, S would be 45/32 ¼ 1.4. The ‘‘classical scaling’’ column shows how transistor properties changed before 2005, when it was possible to scale the threshold voltage and the supply voltage together.
The ‘‘leakage-limited scaling’’ column shows how chip properties changed once we could no longer easily lower threshold or supply voltage without causing either exponential increases in leakage or transistor delay. In both cases, the quantity of transistors increases by a multiplicative factor of S2, their native operating frequency increases by S, and their capacitance decreases by 1/S. However, the two cases differ in supply voltage (VDD) scaling: Under classical scaling, VDD goes down by 1/S, but in the leakage-limited regime, VDD remains fixed because the threshold voltage (Vt) cannot be scaled. When scaling down to the next process generation, the change in a design’s power consumption is the product of all of these terms, with additional squaring for the VDD term.
As Table 2.1.1 shows, although classical scaling resulted in constant power between process generations, power is now increasing as S2. Because our power budget is constant, the utilization of the silicon resources is actually dropping by 1/S2, or a factor of 2 with every process generation.
2.3 Conclusion (For Existing System Discussed):
2.3.1 Experimental Verification
To validate these scaling theory predictions, we performed several experiments targeting current-
day fabrication processes. A small data path — an arithmetic logic unit (ALU) sandwiched between two registers — was replicated across a 40-mm 2 chip in a 90 nm Taiwan Semiconductor Manufacturing Corporation (TSMC) generation. We found that a 3 W power budget would allow only 5 percent of the chip to run at full speed. In a 45 nm TSMC process, this percentage drops to 1.8 percent, a factor of 2.8×. Applying the International Technology Roadmap for Semiconductors (ITRS) for 32 nm suggests utilization would drop to 0.9 percent. These measurements confirm that the trend is upon us, although it has been mitigated slightly by one off improvements to process technology (e.g. strained silicon).
2.3.2 Real World Observations
The real world also provides direct evidence of the utilization wall. Desktop and laptop processor frequencies have increased very slowly for the better part of a decade, and chip core counts have scaled much more slowly than the increase in transistor count. Increasing fractions of the chips are used for cache or low-activity “uncore” logic like memory controllers and chipsets. Recently, Intel and AMD have advertised a “turbo mode” that runs some cores faster if the others are switched off. We can expect similar trends for the future of mobile processors as well.
2.3.3 Designing New Architectures For The Utilization Wall
These observations show that the utilization wall is a fundamental first order constraint for processor design. CMOS scaling theory predicts exponential decreases in the amount of non-dark
silicon with each process generation. To adapt, we need to create architectures that can leverage many, many transistors without actually actively switching them all. In the following section, we
describe GreenDroid’s design, and show how c-cores have these exact qualities and can employ otherwise unused dark silicon to mitigate the extreme power constraints that the utilization wall imposes.
Proposed Methodology
3.1 Our Approach:
Our research leverages two key insights. First, it makes sense to find architectural techniques that trade this cheap resource, dark silicon, for the more valuable resource, energy efficiency. Second, specialized logic can attain 10–1000 times better energy efficiency over general-purpose processors. Our approach is to fill the dark silicon area of a chip with specialized cores in order to save energy on common applications. These cores are automatically generated from the codebase the processor is intended to run. In our case, this codebase is the Android mobile phone software stack, but our approach could also be applied to the iPhone OS as well. We believe that incorporating many automatically generated, specialized cores for the express purpose of saving energy is the next evolution in application processors after multicore. GreenDroid is a 45 nm multicore research prototype that targets the Android mobile phone software stack and can execute general-purpose mobile programs with 11 times less energy than today’s most energy efficient designs, at similar or better levels of performance. It does this through the use of 100 or so automatically generated, highly specialized, energy-reducing cores, called conservation cores, or c-cores .
Our work is novel relative to earlier work on accelerators and high-level synthesis because it adapts these techniques to work in the context of large systems (like Android) for which parallelism is limited, and shows that they are a key tool in attacking the utilization wall CMOS systems face today. In particular, of note is our use of techniques that allow the generation of ccores to be completely automatic, our introduction of patching mechanisms and mechanisms for hiding the existence of the c-cores, and our results on both attainable coverage and potential energy savings. This article continues as follows. First, we explore the factors that lead to the utilization wall. We continue by examining how the architecture of c-cores allows them to take advantage of the utilization wall. Then we examine trends in mobile application processors, and show how the Android operating system lends itself to the use of c-cores. Finally, we conclude the article by examining software depipelining, a key microarchitectural technique that helps save power in c-cores.
The GreenDroid Architecture:
The GreenDroid architecture uses specialized, energy-efficient processors, called conservation cores, or c-cores, to execute frequently used portions of the application code. Collectively, the c-cores span approximately 95 percent of the execution time of our test Android-based workload.
Figure 3.2 shows the high-level architecture of a GreenDroid system. The system comprises an array of tiles (Figure 3.2 a). Each tile uses a standard template (Figure 3.2 b) of an energy-efficient in-order processor, a 32-Kbyte banked Level 1 (L1) data cache, and a point-to-point mesh interconnect (on-chip network, or OCN). The OCN is used for memory traffic and synchronization, similar to the Raw scalable tiled architecture. 4 Each tile is unique and is configured with an array of 8 to 15 c-cores, which are tightly coupled to the host CPU via the L1 data cache and a specialized interface, shown in Figure 3.2 c. This interface lets the host CPU pass arguments to the c-core, perform context switches, and reconfigure the hardware to adapt to changes in the application code.
To create GreenDroid, we profiled the target workload to determine the execution hot spots the regions of code where the processor spends most of its time. Using our fully automated toolchain, we automatically transform these hot spots into specialized circuits, which are attached to a nearby host CPU via the shared L1 cache. The cold code—that is, the less frequently executed
Code runs on the host CPU, whereas the c-cores handle the hot code. Because the c-cores access data primarily through the shared L1 cache, execution can jump back and forth between a c-core and the CPU as it moves from hot code to cold code and back. The specialized circuits that comprise the c-cores are generated in a stylized way that maintains a correspondence with the original program code. They contain extra logic that allows patching that is, modification of the c-core’s behavior as the code that generated the c-core evolves with new software releases. This logic also lets the CPU inspect the code’s interior variables during c-core execution. The c-cores’ existence is largely transparent to the programmer ; a specialized compiler is responsible for recognizing regions of code that align well with the c-cores and generating CPU code and c-core patches, and a runtime system manages the allocation of c-cores to programs according to availability. The c-cores average 18 less energy per instruction for the code that’s translated into
specialized circuits. With such high savings, we must pay attention to Amdahl’s law-style effects, which say that overall system energy savings are negatively impacted by three things: the energy for running cold code on the less-efficient host CPU, the energy spent in the L1 cache, and the energy spent in leakage and for clock power. We reduce the first effect by attaining high execution coverage on the c-cores, targeting regions that cover even less than 1 percent of total execution coverage. We’ve attacked the last two through novel memory system optimizations, power gating, and clock power reduction techniques.
3.2 .1 Implementation Details
Each tile’s CPU is a full-featured 32-bit, seven-stage, in-order pipeline, and features a single-precision floating-point unit (FPU), a multiplier, a 16-Kbyte instruction cache, a translation look aside buffer (TLB), and a 32-Kbyte banked L1 data cache. Our frequency target of 1.5 GHz is set by the cache access time, and is a reasonably aggressive frequency for a 45-nm design. The tiles’ L1 data caches are used to collectively provide a large L2 for the system. Cache coherence between cores is provided by lightweight L2 directories residing at the DRAM interfaces (on the side of the array of tiles; not pictured in Figure 3.2), which use the L1 caches of all the cores as a victim cache.
In addition to sharing the data cache, the c-cores optionally share the FPU and multiplier with the CPU, depending on the code’s execution requirements. Collectively, the tiles in the GreenDroid system exceed the system’s power budget. As a result, most of the c-cores and tiles are usually power gated to reduce energy consumption.
3.2 .2 Execution Model
At design time, the tool clusters c-cores on the basis of profiling of Android workloads, examining both control flow and data movement between code regions. It places related c-cores on the same or nearby tiles, and in some cases, replicates them. At runtime, an application starts on one of the general-purpose CPUs, and whenever the CPU enters a hot-code region, transfers execution to the appropriate c-core. Execution moves from tile to tile on the basis of the applications that are currently active and the c-cores they use. Coherent caches let data be automatically
pulled to where it’s needed, but data associated with a given c-core will generally stay in that c core’s L1 cache. We use aggressive power and clock gating to reduce static power dissipation.
3.2.3 Conservation Core Microarchitecture
Historically, logic design techniques in processor architecture have emphasized pipelining to equalize critical path lengths and increase performance by increasing the clock rate. The utilization wall means that this can be a suboptimal approach. Adding registers increases switched
capacitance, increasing per-op energy and delay. Furthermore, these registers increase the capacitance of the clock tree, a problem compounded by the increased switching frequency rising clock rates require.
For computations that have pipeline parallelism, pipelining can increase performance by overlapping the execution of multiple iterations. This improves energy-delay product despite the increase in energy. However, most irregular code does not have pipeline parallelism, and as a result, pipelining is a net loss in terms of both energy and delay. In fact, the ideal case is to have extremely long combinational paths fed by a slow clock. This minimizes both the energy costs and the performance impact of pipeline registers.
The use of long combinational paths carries two challenges. First, different basic blocks have different combinational critical path lengths, which would require the distribution of many different clocks. Second, there is no way to multiplex centralized or area-intensive resources such as memory and FPUs into these combinational paths.
3.3 Android: Greendroid’s Target Workload:
Android is an open-source mobile software stack developed by Google that features a Linux kernel, a set of application libraries, and a virtual machine called Dalvik. User applications, such as those available in the application store, run on top of the Dalvik virtual machine.
We found that Android is well-suited for c-cores for several reasons. First, although many applications are available for download, Android has a core set of commonly used applications, such as a Web browser, an e-mail client, and media players. Typically, hot code is concentrated in the application libraries, the Dalvik virtual machine, and a few locations in the kernel. Because the hot code is well concentrated, targeting all these components with c-cores lets us attain high coverage over the source base and a significant impact on overall energy usage. Although c-cores support patching, which reduces the impact of post-silicon source base modification, we are also aided by smartphones’ short replacement cycle (typically every 2 to 3 years), which lets smartphone chip designers deploy new c-cores to target new applications. The c-cores interface lets Android phone designers remove c-cores from their designs without impacting code compatibility.
In our experiments with Androidbased workloads—which included the Web browser, Mail, Maps, Video Player, Pandora, and many other applications—we could cover 95 percent of the Android system using just 43,000 static instructions—about 7 mm2 ofc-cores in a 45-nm process. Of this 95 percent, approximately 72 percent of the code was library or Dalvik code shared between multiple applications within the workload.
Android’s usage model also reduces the need for the patching support c-cores provide. Since cell phones have a very short replacement cycle (typically two to three years), it is less important that a c-core be able to adapt to new software versions as they emerge. Furthermore, handset manufacturers can be slow to push out new versions. In contrast, desktop machines have an expected lifetime of between five and ten years, and the updates are more frequent.
3.4.1 Patching support
To remain useful as new versions of the Android platform emerge, GreenDroid’s c-cores must adapt. To support this, c-cores include targeted reconfigurability that lets them maintain perfect fidelity to source code, even as the source code changes.
The adaptation mechanisms include redefining compilet ime constants in hardware and a general exception mechanism that lets c-cores transfer control back and forth to the general-purpose core during any control flow transition. Adding this reconfigurability increases the energy and area needs for c-cores, but significantly improves the span of years over which c-cores can provide energy savings. For the open source codes we used in our experiments, patchable c-cores remained useful for more than a decade of updates and bug fixes, far greater than the typical mobile phone’s lifespan.
3.4.2 Synthesizing c-cores
A GreenDroid processor will contain many different c-cores, each targeting a different portion of the Android system. Designing each c-core by hand isn’t tractable, especially because software release cycles can be short. Instead, we’ve built a C/Cþþto- Verilog toolchain that converts arbitrary regions of code into c-core hardware.1 (See the ‘‘Research Related to GreenDroid’’ sidebar to understand this work’s relationship to accelerators and high-level synthesis.)
The tool chain first identifies the key functions and loops in the target workload and extracts them by outlining loops and in lining functions. A compiler parses the resulting Ccode and generates a static single-assignment based internal representation of the CFG and data flow graph. The compiler then generates Verilog code for the control unit and data path that closely mimics those representations. The compiler also generates function stubs that applications can call in place of the original functions to invoke the hardware.
Finally, the compiler generates a description of the c-core that provides the basis for generating patches that will let the c-core run new versions of the same functions. The close mapping between the compiler’s intermediate representation and the hardware is essential here: small, patchable changes in the source code correspond to small, patchable changes in the hardware.
Because c-cores focus on reducing energy and power consumption rather than exploiting high levels of parallelism, they can profitably target a much wider range of C constructs. Although conventional accelerators struggle to speed up applications with irregular control and limited memory parallelism, c-cores can significantly reduce the energy and power costs of such codes.
3.4.3 Source of Energy Savings
The primary source of energy savings for c-cores can be ascertained from Figure 3.4.3.A baseline energy-efficient in-order MIPS host CPU consumes 91 pJ/instruction, most of which is spent on various overheads due to aspects of instruction interpretation, including I-cache (23%), Fetch/Decode(19%), Register File (14%), and Data path (38%). C-cores, on the other hand, eliminate 91% of the energy used by the host CPU, inheriting only the D-cache power (6%) and the portion of the original data path energy that is involved in performing the actual operations (3%). Thus, c-cores reduce instruction energy from those costs incurred by an instruction marching down a pipeline to those costs incurred by two operators such as adders placed nearby by wires. The end result is an average of 11× reduction in energy per instruction.