03-05-2011, 09:23 AM
Abstract
There is significant interest in the data mining and network management communities about the need to improve existing
techniques for clustering multivariate network traffic flow records so that we can quickly infer underlying traffic patterns. In this paper,
we investigate the use of clustering techniques to identify interesting traffic patterns from network traffic data in an efficient manner. We
develop a framework to deal with mixed type attributes including numerical, categorical, and hierarchical attributes for a one-pass
hierarchical clustering algorithm. We demonstrate the improved accuracy and efficiency of our approach in comparison to previous
work on clustering network traffic.
1 INTRODUCTION
THERE is a growing need for efficient algorithms to detect
important trends and anomalies in network traffic
data. For example, network managers need to understand
user behavior in order to plan network capacity. As
network capacities increase, traffic analysis tools face the
problem of scalability due to high packet arrival rates and
limited memory. In this paper, we present a hierarchical
clustering scheme for identifying significant traffic flow
patterns. In particular, we present a novel way of
exploiting the hierarchical structure of traffic attributes
such as IP addresses, in combination with categorical and
numerical attributes. This scheme addresses the above
problems in previous approaches [2], [3], [4], [21], [22] of
network traffic analysis as it is a one-pass fixed memory
clustering algorithm. We demonstrate the advantages of
our clustering algorithm in terms of improved accuracy
and significantly reduced computation time in comparison
to an earlier approach, on a standard benchmark data set.
A key challenge in clustering multidimensional network
traffic data is the need to deal with various types of
attributes: numerical attributes with real values, categorical
attributes with unranked nominal values, and attributes
with a hierarchical structure. For example, byte counts are
numerical, protocols are categorical, and IP addresses have
a hierarchical structure. A key issue for these schemes is
how to represent a distance function that incorporates
hierarchical attributes to help find meaningful clusters. We
have proposed an approach to clustering network traffic
data that exploits the hierarchical structure present in the
data. In network traffic, a hierarchical relation between two
IP addresses can reflect traffic flow to or from a common
subnetwork. The hierarchical representation of such attributes
thus gives more meaning to a general cluster, which
can reflect a trend in traffic flows. We propose a common
framework to incorporate such hierarchical attributes in the
distance function of our clustering algorithm.
Another key challenge in identifying network traffic
patterns is computational efficiency for high bandwidth
networks and limited computational resources. The second
contribution of this paper is the use of a hierarchical
clustering technique to address the problems suffered by
existing algorithms in terms of their need to make multiple
passes through the data set. Our algorithm presents an
efficient way of generating cluster reports since it uses just
one-pass over the data and builds the entire cluster tree in
memory without resorting to secondary memory accesses.
In order to keep the size of the reports small, we present a
number of summarization techniques over the cluster tree.
In the next section, we briefly summarize existing
research on identifying trends in network traffic. In
Section 3, we define numerical, categorical, and hierarchical
attributes and present a framework for combining their
corresponding distance functions. In particular, in Section
3.3, we propose a set of requirements on a similarity
measure for hierarchical attributes. Based on these requirements,
we survey existing hierarchical similarity measures
and propose three new measures. In Section 4, we present
the details of our hierarchical clustering and summarization
algorithm called Echidna. We demonstrate the effectiveness
of our approach using an empirical evaluation where we
compare against two existing network traffic clustering
techniques in Section 5.
2 RELATED WORK
The problem of monitoring and characterizing network
traffic arises in the context of a variety of network management
functions. Traffic monitoring is used in configuration
management for tasks such as estimating the traffic demands
between different points in the network. In performance
management, traffic monitoring can be used to determine
whether the measured traffic levels exceed the allocated
network capacity, thus causing congestion or delays. When a
fault occurs in the network, traffic monitoring is used in fault
management to help locate the source of the fault, based on
changes in the traffic levels through the surrounding
network elements. In accounting management, traffic monitoring
is needed to measure the network usage by each
customer, so that costs can be charged accordingly. Finally,
network traffic monitoring can be used in security management
to identify unusual traffic flows, which may be caused
by a denial-of-service attack or other forms of misuse.
In order to support these network management functions,
a variety of network monitoring techniques have been
developed to characterize traffic. We categorize these
techniques into traffic matrix measurement, traffic dynamics
measurement, traffic volume measurement, and
traffic mixture measurement.
Traffic matrix. The aim of traffic matrix measurement is
to estimate the volume of traffic between origin and
destination points in the network for capacity planning.
General approaches to this problem include network
tomography and direct measurement. Network tomography
[47] aims to indirectly infer end-to-end traffic demands
based on traffic measurements on each link in the network,
for example, using SNMP link byte counts [5]. This is an
underconstrained problem, and numerous approaches have
been proposed [9], [20], [33] to provide additional prior
information about where traffic is likely to be headed. In
contrast, direct measurement maintains a digest of traffic
flows at each origin point [25]. These digests are then
merged at a central point to find the end point of each flow.
Traffic volume. The aim of traffic volume measurement is
to determine the total traffic sent or received in a network.
Of particular interest is the problem of measuring network
usage of customers. This involves aggregating the total byte
or packet count for each source IP address. This type of
measurement has become important for accounting management
as Internet Service Providers have moved from
time-based accounting to usage-based accounting of customer
charges [7]. Traffic volume measurement is also used in
performance management and security management to
identify heavy users of the network, who may be causing
congestion in the network. For example, Roh and Yoo [6]
propose measuring the ratio of packet count to byte count
as a measure to identify abnormal flows. There are several
existing tools for traffic volume analysis [17]. Some tools
[13] graphically depict the variation of traffic volume, for
example, flow scan [15]. Other tools provide “top K reports”
of heaviest usage, such as cflowd [15] and flow tools [18].
These tools provide visual clues of changes in user behavior
at a very high level, for example, by providing a graphical
report of IP addresses that are sending the most traffic. A
problem with this approach to reporting is that it tells us
nothing about sources that send only a small volume of
traffic. If these small flows are combined, then they may
form a large proportion of the overall traffic. Consequently,
these trends may be overlooked unless we can identify
patterns among traffic flows. Moreover, graphical tools
generally cannot cope well with visualizing traffic with high
dimensions and fail to generalize any underlying patterns.
Thus, there is a need for monitoring techniques that can
aggregate traffic by attributes other than IP address alone.
Traffic dynamics. The aim of monitoring traffic dynamics
is to measure the temporal variation in Internet traffic, for
example, to determine the size of buffers, or the extent
to which links need to be overdimensioned [38]. Since
traditional Poisson models for traffic arrivals fail to account
for the burstiness of Internet traffic [46], there has been
considerable interest in empirical models based on traffic
measurements [19]. In performance management, monitoring
traffic dynamics is used to test the stability of the
network [48], [36]. The types of traffic metrics of interest
include packet delay, packet loss, and the available
bandwidth of bottleneck links [45]. In contrast with the
problem of measuring traffic dynamics, our focus is on the
challenge of monitoring the volume and mixture of flows
within a given sample of network traffic.
Traffic mixture. As mentioned before, when traffic volume
data is aggregated over time, it can reveal important features
of network usage for performance and security management.
Bradford et al. [34] studied aggregated traffic volume
and showed that signal analysis on data aggregation at
certain levels of network traffic helps distinguish among
four broad classes of network anomalies, namely, outages,
flash crowds, attacks, and measurement failures. Kim et al.
[42] suggest a similar technique for traffic anomaly detection
based on analyzing correlations of destination IP addresses
in outgoing traffic. This address correlation data is modeled
using a discrete wavelet transformation to detect anomalies.
Estan et al. [8] address the problem of finding patterns in
network traffic by proposing a frequent item set mining
algorithm. Their tool, called AutoFocus [14], describes the
traffic mix on a network link by using textual reports and
time series plots. It also produces concise reports that can
show general trends in the data. In Section 2.1, we describe a
frequent item set mining of network data in more detail.
Cormode et al. [11], [12] have argued that building an exact
multidimensional lattice is prohibitively expensive and offer
approximate count solutions for a data stream environment.
Kim et al. [28] use the combination of rule-based flow header
detection and a traffic aggregation algorithm. Chhabra et al.
[35] propose a randomized algorithm that is similar to the
technique by Estan et al. [8], which aggregates flows with
similar field values to yield signatures of network traffic. A
major issue with these techniques is that they are computationally
intensive and, hence, do not scale well when
analyzing large volumes of traffic. Some of the other works
which deal with minimizing the effect of large data sets
includes the use of sampling [21], [32], flow histogram
analysis [1], and sketches [31]. Although our work shares a
common philosophy of traffic aggregation using traffic
headers with the techniques mentioned above, our approach
differs in the efficiency and expressiveness of our clustering
technique.
Download full report
http://www.cs.mu.oz.au/~abdun/TR01112005.pdf