15-02-2013, 04:54 PM
Distributed Packet Buffers for High-Bandwidth Switches and Routers
1Distributed Packet.pdf (Size: 4.2 MB / Downloads: 112)
Abstract
High-speed routers rely on well-designed packet buffers that support multiple queues, provide large capacity and short
response times. Some researchers suggested combined SRAM/DRAM hierarchical buffer architectures to meet these challenges.
However, these architectures suffer from either large SRAM requirement or high time-complexity in the memory management. In this
paper, we present scalable, efficient, and novel distributed packet buffer architecture. Two fundamental issues need to be addressed
to make this architecture feasible: 1) how to minimize the overhead of an individual packet buffer; and 2) how to design scalable packet
buffers using independent buffer subsystems. We address these issues by first designing an efficient compact buffer that reduces the
SRAM size requirement by (k 1Þ=k. Then, we introduce a feasible way of coordinating multiple subsystems with a load-balancing
algorithm that maximizes the overall system performance. Both theoretical analysis and experimental results demonstrate that our
load-balancing algorithm and the distributed packet buffer architecture can easily scale to meet the buffering needs of high bandwidth
links and satisfy the requirements of scale and support for multiple queues.
INTRODUCTION
THE phenomenal growth of the Internet has been fueled
by the rapid increase in the communication link
bandwidth. Internet routers play a crucial role in sustaining
this growth by being able to switch packets extremely
fast to keep up with the growing bandwidth (line rate).
This demands sophisticated packet switching and buffering
techniques. Packet buffers need to be designed to
support large capacity, multiple queues, and provide short
response times.
The router buffer sizing is still an open issue. The
traditional rule of thumb for Internet routers states that the
routers should be capable of buffering RTTR [11] data,
where RTT is a round-trip time for flows passing through
the router, and R is the line rate. In [23], the author claimed
that the size of buffers in backbone routers can be made
very small at the expense of a small loss in throughput.
Focusing on the performance of individual TCP flows, the
author claimed in [26] that the output/input capacity ratio
at a network link largely determines the required buffer
size. If the output/input capacity ratio is lower than one,
the loss rate follows a power-law reduction with the buffer
size and significant buffering is needed. Given everlasting
controversy, nowadays, routers manufacturers still seem to
favor the use of large buffers. For instance, the Cisco CRS-1
modular service card with a 40 Gbps line rate incorporates a
2 GB packet buffer memory per line card [4].
BACKGROUND AND RELATED WORK
SRAM and DRAM Technology
Current SRAM and DRAM cannot individually meet the
access time and capacity requirements of router buffers.
While SRAM is fast enough with an access time of around
2.5 ns [28], its largest size is limited by current technologies to
only a few MB. On the other hand, a DRAM can be built with
large capacity, but the typical memory access time (i.e., TRC
1)
is too large, around 40 ns [28]. Over the last decade, the
DRAM memory access time decreases by only 10 percent
every 18 months [15]. In contrast, as the line-rate increases by
100 percent every 18 months [20], DRAM will fall further
behind in satisfying the requirements of high-speed buffers.
Given a DRAM family, in order to keep the DRAM
modules busy, we need to transfer a minimum size chunk (it
is also called as block in [27]) of data into it to effectively
utilize the bandwidth provided by the DRAM module.
Large memory access time of DRAM requires the system to
read/write data from/to any memory address for at least
TRC time units [27]. According to our investigation, the
current chunk size of DRAMs could range from 64 to
320 Bytes.2 However, given much higher price and smaller
capacity of low latency DRAM products, nowadays, high
latency DRAM products such as the DDR3 actually
dominates the market [8], [25], making the typical chunk
size become 320 Bytes.
Buffer Behaviors
When we carefully examine the hierarchical packet buffer
architectures by using the aforementioned methodology,
whether the HSD architecture [27], interleaved DRAMs [9],
[12], [16], [17], or parallel DRAM [10], they all rely on three
parameters, k, b, and Q. The required size of SRAM is
always OðkbQ).
To understand this phenomenon for its further study, we
first examine the buffer behavior of previous hybrid
SRAM/DRAM architectures and algorithms, especially in
the Nemo and the PHSD architectures.
As shown in Fig. 2, the DRAM structure in this extend
version of Nemo is implemented as a composition of k
DRAMs that simply provides a data bus of width k times
that of a single DRAM data bus. Given a fixed chunk size of
b for a single DRAM, Nemo increases the scale of batch load
by k times, which requires each of the Q queues to maintain
kb-size of data. Whenever kb-size of data is accumulated in a
queue, it will be written into k DRAMs through a mutual
data bus. In this way, the size gap between cell and chunk is
compensated. One major drawback of Nemo is that the first
(kb 1) size of data cannot depart from the queue until the
last bit arrives.
DISTRIBUTED PACKET BUFFER ARCHITECTURE
In our view, all packet buffering techniques so far have
adopted a traffic-agnostic approach while designing the
packet buffering algorithms. We must clarify that even
though existing approaches do use Q queues, each queue is
treated the same by the buffer management algorithms. No
effort is made to exploit the inherent characteristics of the
corresponding traffic patterns like the arrival rate, burst
sizes, transit time requirements through the router, etc.
However, a traffic-aware approach to the problem, we
believe, will yield new possibilities for conquering the
scalability problem.
In this paper, we investigate a new dimension to the
problem, viz. how to extend the packet buffer architectures
by using independent packet buffer subsystems. The overall
packet buffer now takes the form of a distributed system
composed of several compact packet buffers. We profess
that the only real requirement for a packet buffer is that it
should be able to absorb incoming traffic at a given rate,
and maintain the outgoing traffic at the same rate, while
still supporting the requirements for the different data
streams transiting through the buffer.