16-09-2016, 11:34 AM
1454844274-AutomaticTestPacketGeneration.pdf (Size: 1.79 MB / Downloads: 6)
Abstract—Networks are getting larger and more complex,
yet administrators rely on rudimentary tools such as and
to debug problems. We propose an automated and
systematic approach for testing and debugging networks called
“Automatic Test Packet Generation” (ATPG). ATPG reads router
configurations and generates a device-independent model. The
model is used to generate a minimum set of test packets to (minimally)
exercise every link in the network or (maximally) exercise
every rule in the network. Test packets are sent periodically, and
detected failures trigger a separate mechanism to localize the
fault. ATPG can detect both functional (e.g., incorrect firewall
rule) and performance problems (e.g., congested queue). ATPG
complements but goes beyond earlier work in static checking
(which cannot detect liveness or performance faults) or fault
localization (which only localize faults given liveness results). We
describe our prototype ATPG implementation and results on two
real-world data sets: Stanford University’s backbone network and
Internet2. We find that a small number of test packets suffices to
test all rules in these networks: For example, 4000 packets can
cover all rules in Stanford backbone network, while 54 are enough
to cover all links. Sending 4000 test packets 10 times per second
consumes less than 1% of link capacity. ATPG code and the data
sets are publicly available.
INTRODUCTION
“Only strong trees stand the test of a storm.”
—Chinese idiom
I T IS notoriously hard to debug networks. Every day,
network engineers wrestle with router misconfigurations,
fiber cuts, faulty interfaces, mislabeled cables, software bugs,
intermittent links, and a myriad other reasons that cause networks
to misbehave or fail completely. Network engineers
hunt down bugs using the most rudimentary tools (e.g., ,
, SNMP, and ) and track down root causes
using a combination of accrued wisdom and intuition. Debugging
networks is only becoming harder as networks are getting
bigger (modern data centers may contain 10 000 switches, a
campus network may serve 50 000 users, a 100-Gb/s long-haul link may carry 100 000 flows) and are getting more complicated
(with over 6000 RFCs, router software is based on millions of
lines of source code, and network chips often contain billions
of gates). It is a mall wonder that network engineers have been
labeled “masters of complexity” [32]. Consider two examples.
Example 1: Suppose a router with a faulty line card starts
dropping packets silently. Alice, who administers 100 routers,
receives a ticket from several unhappy users complaining about
connectivity. First, Alice examines each router to see if the
configuration was changed recently and concludes that the
configuration was untouched. Next, Alice uses her knowledge
of the topology to triangulate the faulty device with and
. Finally, she calls a colleague to replace the line
card.
Example 2: Suppose that video traffic is mapped to a specific
queue in a router, but packets are dropped because the token
bucket rate is too low. It is not at all clear how Alice can track
down such a performance fault using and .
Troubleshooting a network is difficult for three reasons. First,
the forwarding state is distributed across multiple routers and
firewalls and is defined by their forwarding tables, filter rules,
and other configuration parameters. Second, the forwarding
state is hard to observe because it typically requires manually
logging into every box in the network. Third, there are
many different programs, protocols, and humans updating the
forwarding state simultaneously. When Alice uses and
, she is using a crude lens to examine the current
forwarding state for clues to track down the failure.
Fig. 1 is a simplified view of network state. At the bottom of
the figure is the forwarding state used to forward each packet,
consisting of the L2 and L3 forwarding information base (FIB),
access control lists, etc. The forwarding state is written by
the control plane (that can be local or remote as in the SDN
model [32]) and should correctly implement the network administrator’s
policy. Examples of the policy include: “Security group X is isolated from security Group Y,” “Use OSPF for
routing,” and “Video traffic should receive at least 1 Mb/s.”
We can think of the controller compiling the policy (A) into
device-specific configuration files (B), which in turn determine
the forwarding behavior of each packet ©. To ensure the network
behaves as designed, all three steps should remain consistent
at all times, i.e., . In addition, the topology,
shown to the bottom right in the figure, should also satisfy a set
of liveness properties . Minimally, requires that sufficient
links and nodes are working; if the control plane specifies that a
laptop can access a server, the desired outcome can fail if links
fail. can also specify performance guarantees that detect flaky
links.
Recently, researchers have proposed tools to check that
, enforcing consistency between policy and the configuration
[7], [16], [25], [31]. While these approaches can find
(or prevent) software logic errors in the control plane, they are
not designed to identify liveness failures caused by failed links
and routers, bugs caused by faulty router hardware or software,
or performance problems caused by network congestion. Such
failures require checking for and whether . Alice’s
first problem was with (link not working), and her second
problem was with (low level token bucket state not
reflecting policy for video bandwidth).
In fact, we learned from a survey of 61 network operators (see
Table I in Section II) that the two most common causes of network
failure are hardware failures and software bugs, and that
problems manifest themselves both as reachability failures and
throughput/latency degradation. Our goal is to automatically detect
these types of failures.
The main contribution of this paper is what we call an Automatic
Test Packet Generation (ATPG) framework that automatically
generates a minimal set of packets to test the liveness of
the underlying topology and the congruence between data plane
state and configuration specifications. The tool can also automatically
generate packets to test performance assertions such
as packet latency. In Example 1, instead of Alice manually deciding
which packets to send, the tool does so periodically
on her behalf. In Example 2, the tool determines that it must send
packets with certain headers to “exercise” the video queue, and
then determines that these packets are being dropped.
ATPG detects and diagnoses errors by independently and exhaustively
testing all forwarding entries, firewall rules, and any
packet processing rules in the network. In ATPG, test packets
are generated algorithmically from the device configuration files
and FIBs, with the minimum number of packets required for
complete coverage. Test packets are fed into the network so
that every rule is exercised directly from the data plane. Since
ATPG treats links just like normal forwarding rules, its full coverage
guarantees testing of every link in the network. It can also
be specialized to generate a minimal set of packets that merely
test every link for network liveness. At least in this basic form,
we feel that ATPG or some similar technique is fundamental
to networks: Instead of reacting to failures, many network operators
such as Internet2 [14] proactively check the health of
their network using pings between all pairs of sources. However,
all-pairs does not guarantee testing of all links and
has been found to be unscalable for large networks such as
PlanetLab