30-07-2012, 03:38 PM
Dynamic Leakage-Energy Management of Secondary Caches
Dynamic Leakage-Energy Management of Secondary Caches.pdf (Size: 100.8 KB / Downloads: 26)
Introduction
As threshold voltages scale downward along with transistor feature sizes, static power dissipation
due to subthreshold leakage current is becoming an increasingly significant fraction of total
power consumption for microprocessors. Total leakage power is estimated to increase fivefold
with each process generation [5], equalling dynamic power within a few generations [7].
At the same time, growing transistor budgets have led to the incorporation of large on-chip
secondary caches on mid-range to high-performance microprocessors, including those used in
laptop and desktop PCs as well as server systems.1 On-chip secondary caches serve a valuable
purpose in avoiding off-chip memory accesses, which are very costly in terms of both energy and
performance. Because these caches contain large numbers of transistors, of which only a small
fraction switch in any given cycle, leakage may account for as much as 80% of secondary cache
energy consumption [17].
PreviousWork
Previous work on cache resizing addresses two issues: circuit techniques for placing portions
of the cache into a low-leakage state, and architectural techniques for determining what fraction
of the cache, if any, should be placed in such a state. This work falls into the latter category. We
include a brief discussion of the former category as background.
Circuit Techniques for Reducing SRAM Leakage
One approach to reducing cache leakage is to fabricate cache transistors with a higher threshold
voltage (Vt) via doping. This technique incurs a performance penalty, in that the increased Vt
causes these transistors to switch more slowly; this penalty can be reduced by using high-Vt transistors
only in the less performance-critical paths of the SRAM cell, an approach known as dual-
Vt [9]. However, higher Vt values reduce, but do not eliminate, leakage current. As supply voltages
cross below 1.0V, transistors with Vt high enough to make leakage energy negligible may not
switch at acceptable speeds [16]. Thus high-Vt fabrication provides at best a partial solution to
leakage energy.
Architectural Techniques for Cache Resizing
Given a circuit-level method such as gated Vdd for reducing leakage energy dynamically, a
higher-level policy must determine when and how to apply the method. We describe three such
architectural techniques: the DRI cache [17], cache decay [11], and adaptive mode control (AMC)
[19]. The techniques use gated Vdd to disable portions of the cache, adjusting the effective cache
size dynamically to match application requirements. Other cache-resizing techniques (e.g., [2, 4,
18]) focus on optimizing cycle time and/or dynamic power rather than leakage energy.
The DRI cache design [17] is intended to minimize the leakage energy of L1 instruction
caches. The DRI scheme uses a predetermined miss bound for its resizing decisions. If the number
of misses in a given interval is less than the miss bound, a section of the cache is turned off; if
greater, a section is turned on. The cache is resized by adjusting the number of active sets, which
restricts the cache to be grown or shrunk by factors of two. In addition, resizing changes the set of
address bits used for indexing, complicating the tag-matching logic. By focusing on instruction
caches, the DRI cache avoids the need to write back dirty data when the cache is shrunk.
Calculating Energy Savings from Secondary Cache Resizing
This section presents a brief analysis of the energy savings due to secondary-cache resizing.
First, we derive an equation for estimating net energy savings. The ratio of off-chip access energy
to secondary-cache leakage energy is a key parameter in this equation. We then estimate values
for this ratio using published leakage values combined with measured off-chip energy costs.
Calculating Net Energy Savings
There are two primary sources of energy overhead induced by turning off portions of a secondary
cache. First, the reduction in effective cache size leads to additional off-chip accesses,
each of which consumes dynamic energy. Second, these additional off-chip accesses increase program
runtime. The additional execution cycles could otherwise be spent on a different task or in a
low-power idle mode. Thus the additional runtime potentially incurs dynamic and static energy
overheads across the entire system, including the processor core, L1 caches, memory, and I/O
devices. Additional dynamic energy dissipation in the processor core and elsewhere can be significantly
reduced through power-saving techniques such as clock gating. Static energy consumption
in other components may be significant, but is difficult to estimate within the scope of this paper.
Therefore we focus on only two sources of overhead: dynamic energy due to extra off-chip
accesses, and leakage energy dissipated by the L2 cache itself during the extra execution cycles.
Estimating the Off-chip to Leakage Energy Ratio, ROL
To find an estimate for the value of ROL, we need to determine both the per-cycle leakage
energy for a cache of a given size and the energy consumed by an off-chip access. We obtain the
former from prior work. We found no published estimates for the latter, so we derived this value
experimentally by measuring the power consumption of an actual system.
Table 2 in [17] indicates that a single SRAM bit in a 0.18 μm process has a leakage energy of
1.74x10-6 nJ per cycle at a temperature of 110C. Thus the data portion of a 1MB cache has a leakage
energy per cycle of roughly 15 nJ. Given variations in process technology, cache sizes, and
cycle times, this value is merely a ballpark estimate, but is adequate for our purposes.
Since the energy cost of an off-chip access has not been widely studied, we designed and performed
our own experiment to measure it. We developed a microbenchmark that executes a simple
loop containing a single memory access to a dynamically selected element of a large array. By
exploiting the differences in size and associativity between the primary and secondary caches on
the machine under test, the microbenchmark guarantees that this access always misses in the primary
cache, and can choose dynamically whether the access also misses in the secondary cache.
Parameters to the microbenchmark select a reference pattern of n L2 misses followed m hits. This
pattern is repeated until a minimum amount of time has elapsed.