24-12-2012, 04:13 PM
Highly Efficient LRU Implementations for High Associativity Cache Memory
1Highly Efficient LRU.pdf (Size: 657.6 KB / Downloads: 71)
Abstract
High associativity with replacement policy as LRU is an
optimal solution for cache design when miss rate has to be
reduced. But when associativity increases, implementing LRU
policy becomes complex. As many advance and demanding
technologies like multimedia, multithreading, database and low
power devices running on high performance processors in servers
and work stations use higher associativity to enrich performance,
there is a need for designing highly efficient LRU hardware
implementations. This paper gives analyses various
implementations of the LRU policy for a cache with high
associativity. The implementation problems are explored,
objectives of the design are identified and various
implementations namely Square Matrix, Skewed Matrix,
Counter, Link-list, Phase and Systolic Array methods are
compared with each other on the basis of objective outlined. These
implementations are synthesized to determine the constraints and
the effect of increase in associativity on the performance. When
the associativity is smaller, reduction of associated logic is
important and at higher associativity conservation of space is
more important. At higher associativity Linked List, Systolic
Array and Skewed Matrix are the designs found suitable for
implementations.
INTRODUCTION
Modern processors, commercial systems, high
performance servers, workstation have high associative caches
for performance improvement [15,16,17]. The complexity of
implementation of LRU (Least Recently Used) policy for
highly associative cache tends to increase as the associativity
increases [1,2,3,4,10]. The increase in complexity additionally
increases the delay incurred to detect the line for replacement.
The cache performance is degraded even though a highly
associative cache with LRU policy is used due to inapt
implementation. This paper analyzes
and compares various efficient LRU implementations for
higher associative caches. These designs are analyzed with
respect to their implementation complexity and how fast can
they determine the replacement cache line. The various
implementation of LRU are simulated and synthesized for
comparison. The rest of the paper is organized in the following
manner. Section 2 identifies higher associativity with LRU as
best configuration to reduce miss ratio. Section 3 discusses the
implementation complexity of LRU as associativity increases.
Section 4 examines various implementations, their working
and their characteristics.
HIGHER ASSOCIATIVITY WITH LRU POLICY
The classical approach to improve the cache behavior is
reducing miss rate. Increasing associativity in the cache
reduces conflict misses thereby reducing miss rates and
improving performance. Studies have shown that conflict miss
reduces from 28% to 4% when the associativity changes from
1-way to 8-way [2]. Another result showed number of cache
misses reduced from 30,000 to as low as 5000 when a higher
associativity (512 way) cache is used instead of direct mapped
[10]. Further higher associative cache is more efficient when
miss penalty is large and memory inter connect contention
delay is significant and sensitive to the cache miss rate [6].
Increasing Associativity with any replacement policy often
decreases the miss ratio. Better performance of higher
associativity depends on efficient replacement algorithm [4].
The replacement algorithm LRU, that replaces the least used
line in cache, has miss ratio and performance comparable to
optimal (OPT or MIN) algorithm. In LRU policy the line not
referenced for the longest period of time is considered as dead
line and removed from cache.
IMPLEMENTATION COMPLEXITY
A 2-way set associative cache with LRU policy can be
Implemented with one bit called the access bit. When a line is
accessed from the set the access bit of the accessed line is set
representing most recently used line and the access bit of the
other line is reset to zero representing least recently used line.
If associativity is increased to four LRU then it could be
implemented as a counter where the number of access bits will
be two. Beyond this implementing a LRU policy becomes
complex. The number of lines in a cache set increases
increasing the storage space to maintain the LRU history,
thereby increasing cache size and cost [1,2,3,10]. Complexity
of the logic to implement the LRU also increases [4]. Studies
reveal that the performance of the LRU policy reduces as the
associativity increases [14]. Results obtained shows that the
LRU performs close to OPT replacement algorithm when
associativity is less but the LRU has a large number of victims
to choose from when the associativity is large decreasing
performance [13].
DESIGN OF VARIOUS IMPLEMENTATIONS
The information of each access should be logged in a data
structure that determines the performance of the LRU
hardware. Each set in the associative cache has its own LRU
hardware for implementing the LRU policy. On referencing
this set the corresponding hardware is also invoked requiring
no separate detection. The collection of this hardware for all
the sets in the cache is the Global Set. And the hardware for the
set, which is being referenced, is the Working Set. The cache
line index in case of hit is index of line whose tag matches with
tag bits of referenced address and in the case of miss is the
index of line which is identified as the line to replace by LRU
hardware. Here we compare six different implementations of
LRU policy for attaining high performance for an N-way set
associative cache with Square Matrix, Skewed Matrix,
Counter, Link list, Phase and Systolic Array methods
Square Matrix Implementation
This scheme implements a simple data structure with a
simple storage element, D flip-flop. Data structure is a square
matrix of this storage element and is of order N for an N-way
set associative cache. The global set contains M replications of
this data structure for a cache with M sets. Here each of the N
rows of the data structure maps to one of the N cache lines of
the set and logs the access information of that line. Initially all
the bits in matrix are set to zero (Fig. 1). When the cache set is
identified, the corresponding data structure in LRU hardware
becomes the working set. The Square-Matrix implementation
follows a simple logging scheme wherein, it sets the row of
accessed line to one and after this sets the column of the
accessed line to zero. The number of ones in each row is an
indication of the order of the accesses cache lines within the
set. A line with more number of 1’s is more recently accessed
than the one that has less number of 1’s. The row in the matrix,
which has maximum number of 1’s, is the line most recently
used and row, which has the entire row set to zero is the line
least recently used.
Counter Implementation
Using a register for each row to maintain the LRU
history can significantly reduce the large space required by the
Square Matrix and Skewed Matrix implementation. As the
value of N becomes higher there is exponential drop in the
storage space required as compared with previous
implementations. There is one to one mapping between the
registers, used to record LRU information and the cache lines
in a set. The values in the register indicate the order in which
the cache lines within a set have been accessed. A register with
a larger value means that corresponding cache line is more
recently accessed than the line whose register has a lesser
value. The smallest value, Zero in the register indicates the
corresponding cache line is least recently accessed line and the
highest value, N-1 indicates the corresponding cache line is
most recently accessed line. Initially all the registers are set to
zero. The value of the register, which is called active register
whose cache line being accessed is compared with the value of
other registers .
Link List Implementation
The Link-list implementation, we always know the
index of the line to be replaced which incurs considerable
delay. The implementation uses less space and uses the logic to
determines the LRU line with minimum delay and at the same
time update the data structure. The cache line indexes are
mapped to two lists Previous and Next. The Next register of
the cache line maintain the index of the line that was accessed
after that cache line and the Previous register of the cache line
maintain the index of the line that was accessed before that
cache line. The most recently used cache lines are moved to the
head of the list and the less recently used lines to the end of the
list. LRU register keeps track of the index at the end of the list
and the MRU, the index at the head of the list. An arbitrary
ordered list can be chosen to initialize all the registers
(Fig. 10) and the line at the end of the list becomes the LRU.
When lines are accessed their order of access is determined in
the list. The algorithm used for the updating the lists handles
the three cases. If the accessed line is LRU then it is made to
point to the next of LRU, and if it is the MRU then nothing
needs to be done. But for any other line the previous node is
made to point to the next node in the Next list and similarly the
Previous list is updated. The accessed line is made the
MRU.Each of the storage elements in Fig. 11 is basically a
log2N-bit register that stores the value when load signal is
high. The register is also loaded at reset.
CONCLUSIONS
This paper has focused on the implementation of the
LRU replacement policy for caches with high associativity. As
implementing LRU policy in hardware for high associativites
is difficult, implementation objectives are identified and
various implementations namely Square Matrix, Skewed
Matrix, Counter, Phase, Link List and Systolic Array were
studied. The results of the different implementations for
increasing associativity were analyzed. It is inferred that for
higher associativity conservation of space to store data of the
schemes is important but the associated logic cannot be totally
neglected. At higher associativity link List, Systolic Array and
Skewed Matrix are the designs most suitable for
implementations, and also with increase in associativity the
Link List, Systolic and Skewed Matrix methods would involve
less delay. Although the implementation size for one set grows
rapidly for increase in associativity, the similar increase when
the entire cache is considered is much less.