05-03-2013, 02:41 PM
Handling Multiple Failures in IP Networks through Localized On-Demand Link State Routing
Handling Multiple Failures.pdf (Size: 380.2 KB / Downloads: 46)
Abstract
It has been observed that transient failures are fairly
common in IP backbone networks and there have been several
proposals based on local rerouting to provide high network
availability despite failures. While most of these proposals are
effective in handling single failures, they either cause loops or
drop packets in the case of multiple independent failures. To
ensure forwarding continuity even with multiple failures, we
propose Localized On-demand Link State (LOLS) routing. Under
LOLS, each packet carries a blacklist, which is a minimal set
of failed links encountered along its path, and the next hop is
determined by excluding the blacklisted links. We show that the
blacklist can be reset when the packet makes forward progress
towards the destination and hence can be encoded in a few bits.
Furthermore, blacklist-based forwarding entries at a router can
be precomputed for a given set of failures requiring protection.
While the LOLS approach is generic, this paper describes how it
can be applied to ensure forwarding to all reachable destinations
in case of any two link or node failures. Our evaluation of
this failure scenario based on various real network topologies
reveals that LOLS needs 6 bits in the worst case to convey the
blacklist information. We argue that this overhead is acceptable
considering that LOLS routing deviates from the optimal path
by a small stretch only while routing around failures.
Index Terms—Fast reroute, failure resilience, local rerouting.
INTRODUCTION
THE Internet is increasingly being used for missioncritical
applications and it is expected to be always
available. Unfortunately, service disruptions happen even in
well-managed networks due to link and node failures. There
have been some studies [1]–[3] on frequency, duration, and
type of failures in an IP backbone network. [2] reported that
failures are fairly common and most of them are transient:
46% last less than a minute and 86% last less than ten
minutes. To support emerging time-sensitive applications in
today’s Internet, these networks need to survive failures with
minimal service disruption. For example, a disruption time of
longer than 50 ms is considered intolerable for mission-critical
applications [4]. Therefore, providing uninterrupted service
availability despite transient failures is a major challenge for
service providers.
RELATED WORK
Numerous approaches have been proposed in the past to
make networks more resilient to failures [17]. We categorize
them into schemes that protect against single or correlated
failures and those that can deal with multiple independent
failures. Also, some of them attempt to reduce the routing
overhead by controlling the range/frequency of link state
updates. Another group of schemes construct a set of diverse
paths for forwarding, even across domains.We briefly describe
a few schemes belonging to each of these categories in the
following. Finally, we elaborate on the similarities between
forwarding around failures by LOLS, and forwarding around
voids by geographic routing schemes like GPSR [19].
Single or Correlated Failures: The Not-via approach [20]
locally reroutes a packet around a known failure by encapsulating
the packet to an address that implicitly identifies the failed
network component to be avoided. Another scheme, known as
Multiple Routing Configurations (MRC), proposed in [14] separates
all failures into multiple routing configurations and lets
the packet carry the configuration information upon detecting
a failure, so that the downstream routers can select the path
consistently. Failure Inferencing based Fast Rerouting (FIFR)
infers failures based on a packet’s flight (the inbound interface
on which they are received), precomputes interface-specific
forwarding tables, and triggers local rerouting upon detecting
an adjacent failure. The above and other similar schemes [21]–
[26] offer resilience against single or correlated failures but are
not designed to recover from multiple unrelated failures. They
do, however, help make LOLS practical — blacklist size can
be reduced with interface-face specific forwarding, and not-via
addresses can be used to encode blacklist information.
LOCALIZED ON-DEMAND LINK STATE ROUTING
In this section, we start with an illustration to present the
intuition behind the localized on-demand link state routing
approach. We then formally describe the forwarding operation
at each router along the path traversed by the packet.
Intuition and Illustration
Consider the topology shown in Fig. 1. Suppose each link’s
state is advertised to be up with the labelled cost, but currently
the dashed links are down with cost ∞. Further assume that
the nodes adjacent to a dashed link are aware of its current
failed state (we use state and cost interchangeably as the cost
of a link reflects its state). We refer to the dashed links whose
current cost is worse than their advertised cost as degraded. If
a link’s current cost is worse than advertised, local rerouting
around that link can cause loops. On the other hand, if a link’s
current state is better than advertised, forwarding over that link
is loop-free. Hence, we focus only on degraded links.
Optimality of LOLS
Under both LOLS and FCP, a packet takes the usual shortest
path until it encounters a failure and then gets rerouted
along the alternate path. Consequently, in the presence of
failures, these schemes may forward packets along longer
paths compared to the optimal paths computed based on
the global link state updates. Furthermore, with LOLS, it is
possible that after the packet’s blacklist gets reset, a failure
downstream towards the destination can cause a packet to
backtrack, gather a larger blacklist, and reach the destination
via a longer detour. FCP is not expected to have this problem
as it carries failure information all the way to the destination.
Hence, it is important to investigate whether the gains of
LOLS over FCP in terms of failure propagation and blacklist
size come at the expense of suboptimal paths. Additionally,
we study whether local rerouting of packets with LOLS causes
excessive overloading of some links in the network.
CONCLUSIONS AND FUTURE WORK
In this paper, we presented LOLS, a localized on-demand
link state routing for handling multiple failures in IP backbone
networks. The core idea behind LOLS is to have packets carry
a blacklist of degraded links encountered along the path that
are to be avoided in order to ensure loop-free forwarding. The
key feature of LOLS is that a packet’s blacklist is reset as soon
as it makes forward progress towards the destination, limiting
the propagation of failure information to just a few hops. We
have proved that LOLS guarantees loop-free forwarding to
reachable destinations regardless of the number of failures in
the network. We have evaluated the overhead due to LOLS
using several large real topologies and shown that it scales
better than the recently proposed scheme FCP which has
similar failure resilience objectives. We have also presented
a practical version of LOLS for protecting against predefined
failures, and shown that it needs only a modest number of
header bits or not-via addresses for handling any two link/node
failures. Our plan is to implement a prototype of LOLS using
Mininet system [41] to demonstrate its deployability.