17-08-2013, 04:25 PM
Load-Balancing Multipath Switching System with Flow Slice
Load-Balancing Multipath .pdf (Size: 2.25 MB / Downloads: 56)
Abstract
Multipath Switching systems (MPS) are intensely used in state-of-the-art core routers to provide terabit or even petabit
switching capacity. One of the most intractable issues in designing MPS is how to load balance traffic across its multiple paths while
not disturbing the intraflow packet orders. Previous packet-based solutions either suffer from delay penalties or lead to OðN 2 Þ
hardware complexity, hence do not scale. Flow-based hashing algorithms also perform badly due to the heavy-tailed flow-size
distribution. In this paper, we develop a novel scheme, namely, Flow Slice (FS) that cuts off each flow into flow slices at every intraflow
interval larger than a slicing threshold and balances the load on a finer granularity. Based on the studies of tens of real Internet traces,
we show that setting a slicing threshold of 1 À 4 ms, the FS scheme achieves comparative load-balancing performance to the optimal
one. It also limits the probability of out-of-order packets to a negligible level (10À6 ) on three popular MPSes at the cost of little hardware
complexity and an internal speedup up to two. These results are proven by theoretical analyses and also validated through trace-driven
prototype simulations.
INTRODUCTION
Switching systems (MPS) play a pivotal role
in fabricating state-of-the-art high performance core
routers. A well-known paradigm is the deployment of Benes
multistage switches in Cisco CRS-1 [1]. Other examples
include the Vitesse switch chip family [3] implementing the
Parallel Packet Switch (PPS), and the Load-balanced Birkh-
off-von Neumann (LBvN) switches [9], [10]. In general, MPS
is built by aggregating several lower speed switches and,
therefore, exhibits multiple internal data paths.
One major open issue in MPS is the load-balancing
problem defined as how to distribute incoming traffic A(t)
across its k internal switching paths fÀl gðl 2 1⁄21; kÞ to meet
at least three objectives simultaneously:
RELATED WORK
A basic timestamp-based resequencer was proposed by
Turner [30] to deal with the cell out-of-order issue, when
cells within a virtual circuit are allowed to take different
paths in ATM switching systems. In each scheduling, the
resequencer implements a smart contention resolution
mechanism to select the oldest cell and then compares its
age with a predefined threshold equal to the system delay
upper bound. If the selected cell exceeds this age threshold,
it will be sent without disturbing cell orders since all
previously arrived cells have left the system. Henrion
improved the timestamp-based resequencer by introducing
time-wheel-like hardware [17] which does not need to
compare cells’ timestamps at output. It maintains an array
of D pointer lists, where D is larger than delay upper bound
at system measured by timeslots. Each pointer at the list
stores location of a cell waiting for resequencing. At
timeslot t, the pointer list at slot t modulo D is linked to
the tail of the output cell list and removed from the array.
After that, pointers of arrival cells at timeslot t are linked
together and stored in this empty slot (t modulo D) of the
array. This approach delays every cell by fixed D timeslots
with O(1) complexity and strictly guarantees cell orders.
FLOW SLICE
In this section, we study flow-slice properties using real
traces listed in Table 1 [2]. The traces are collected at
separate locations from the network core to edge/access
points; hence, the traces represent comprehensive Internet
traffic behaviors. Each data set records consecutive packet
headers (TCP/IP) during an observation period of 1-10 min.
The original 5-tuple at each header is remapped, using the
longest prefix-preserving mapping, to new 5-tuple address
spaces, but they still retain the original flow-level char-
acteristics. The timestamp at each packet header has a
resolution of at least 1 microsecond.
Definition
Flow Slice
A flow slice is a sequence of packets in a flow, where every
intraflow interval between two consecutive packets is
smaller than or equal to a slicing threshold È.
Flow slices can be seen as miniflows created by cutting
off every intraflow interval larger than È. In Fig. 2, we
depict the Cumulative Distribution Functions (C.D.F) of
intraflow intervals in our traces. Most of the traces have
more than 50 percent of their intervals larger than 1 ms, and
more than 40 percent are larger than 4 ms.
Load-Balancing Scheme for MPS
Fig. 8 illustrates the proposed FS scheme and compares it to
the packet-based and flow-based load-balancing schemes.
We describe the FS scheme in two steps: 1) an introduction
of the mechanism to detect and maintain flow-slice context,
and 2) an algorithm to map new flow slices into individual
switching paths.
First, we implement a hash table to maintain flow-slice
information. Every table entry is expected to record the
context of one active flow slice and is composed of two fields
apart from the table index: Latest Arrival Timestamp (TL ),
which holds the exact time of the latest packet arrival of this
flow slice; and Flow-slice Destination Path (PD ), indicating the
switching path taken by all the packets in this flow slice.
Multistage Multiplane Clos Switches
We consider the Multistage Multiplane Clos-network-
based switch by Chao et al. [13] depicted in Fig. 12 (called
M2 Clos in brief). It is constructed of five stages of switch
modules with top-level architecture similar to a PPS with
nk external input/output ports. The first and last stages of
M2 Clos are composed of nk input demultiplexers and
output multiplexers, respectively, having similar internal
structures as those in PPS. Stages 2-4 of M2 Clos are
constructed by p nk  nk parallel switching planes; how-
ever, each plane is no longer formed by a basic OQ switch,
but by a three-stage Clos Network to support large port
count. Inside each Clos Network, the first stage (stage 2 of
M2 Clos) is composed by k identical Input Modules (IM).
Each IM is an n  m packet switch, with each output link
connected to a Central Module (CM).
PERFORMANCE EVALUATION
Prototype Configuration
We establish prototypes for all the three MPSes by software
modeling. Our PPS prototype has the same architecture as
the model given in Fig. 10a, except that there is no modeling
of the queuing inside the multiplexers, since the multiplexers
do not reorder packets. In our prototype, each packet will
depart from PPS after the packet leaves the switching plane.
(There is one exception for PPS with VIQ resequencer: the
multiplexer is simulated since it plays a critical role in this
case.) The PPS prototype has 32 external ports and eight
switching planes. Each port works at a 40 Gbps line rate and
each switching plane operates at 5 Gbps with no speedup.
For the OQ at the central switching plane and the LB-FIFO at
the demultiplexer, each has a buffer size of 10 MB each. Our
LBvN prototype has 32 external ports, each working at
40 Gbps; our M2 Clos prototype has 64 external ports
(m 1⁄4 n 1⁄4 k 1⁄4 8), each working at 35 Gbps, and seven
switching planes (p and m are relatively prime), each
operating at 5 Gbps. The other specifications of LBvN and
M2 Clos are the same as those of PPS.
CONCLUSION, DISCUSSION AND FUTURE WORK
We propose a novel load-balancing scheme, namely, Flow
Slice, based on the fact that the intraflow packet interval is
often, say in 40-50 percent, larger than the delay upper
bound at MPS. Due to three positive properties of flow slice,
our scheme achieves good load-balancing uniformity with
little hardware overhead and O(1) timing complexity. By
calculating delay bounds at three popular MPSes, we show
that when the slicing threshold is set to the smallest
admissible value at 1-4 ms, the FS scheme can achieve
optimal performance while keeping the intraflow packet
out-of-order probability negligible (below 10À6 ), given an
internal speedup up to two. Our results are also validated
through trace-driven prototype simulations under highly
bursty traffic patterns.