Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Automatic Network Protocol Analysis ppt
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Automatic Network Protocol Analysis

[attachment=32951]

Abstract

Protocol reverse engineering is the process of ex-
tracting application-level specifications for network pro-
tocols. Such specifications are very helpful in a number
of security-related contexts. For example, they are needed
by intrusion detection systems to perform deep packet in-
spection, and they allow the implementation of black-box
fuzzing tools. Unfortunately, manual reverse engineering
is a time-consuming and tedious task. To address this prob-
lem, researchers have recently proposed systems that help
to automate the process. These systems operate by ana-
lyzing traces of network traffic. However, there is limited
information available at the network-level, and thus, the
accuracy of the results is limited.
In this paper, we present a novel approach to automatic
protocol reverse engineering. Our approach works by dy-
namically monitoring the execution of the application, an-
alyzing how the program is processing the protocol mes-
sages that it receives. This is motivated by the insight that
an application encodes the complete protocol and repre-
sents the authoritative specification of the inputs that it can
accept. In a first step, we extract information about the
fields of individual messages. Then, we aggregate this in-
formation to determine a more general specification of the
message format, which can include optional or alternative
fields, and repetitions. We have applied our techniques to
a number of real-world protocols and server applications.
Our results demonstrate that we are able to extract the for-
mat specification for different types of messages. Using
these specifications, we then automatically generate ap-
propriate parser code.

Introduction

Protocol reverse engineering is the process of extract-
ing application-level protocol specifications. The detailed
knowledge of such protocol specifications is invaluable for
addressing a number of security problems. For example,
it allows the automated generation of protocol fuzzers [24]
that perform black-box testing of server programs that ac-
cept network input. In addition, protocol specifications are
often required for intrusion detection systems [26] that im-
plement deep packet inspection capabilities. These sys-
tems typically parse the network stream into segments with
application-level semantics, and apply detection rules only
to certain parts of the traffic. Generic protocol analyzers
such as binpac [25] and GAPA [2] also require protocol
grammars as input. Moreover, possessing protocol infor-
mation helps to identify and understand applications that
may communicate over non-standard ports or application
data that is encapsulated in other protocols [15, 20]. Fi-
nally, knowledge about the differences in the way that cer-
tain server applications implement a standard protocol can
help a security analyst to perform server fingerprinting [5],
or guide testing and security auditing efforts [3].

System design

Automatic protocol reverse engineering is a complex
and difficult problem. In the following section, we intro-
duce the problem domain and discuss the specific problems
that our techniques address. Then, we provide a high-level
overview of the workings of our system.

Problem scope

In [10], the authors introduce a terminology for com-
mon protocol idioms that allow a general discussion of the
problem of protocol reverse engineering. In particular, the
authors observe that most application protocols have a no-
tion of an application session, which allows two hosts to
accomplish a specific task. An application session consists
of a series of individual messages. These messages can
have different types. Each message type is defined by a cer-
tain message format specification. A message format spec-
ifies a number of fields, for example, length fields, cookies,
keywords, or endpoint addresses (such as IP addresses and
ports). The structure of the whole application session is
determined by the protocol state machine, which specifies
the order in which messages of different types can be sent.
Using that terminology, we observe that automatic pro-
tocol reverse engineering can target different levels. In the
simplest case, the analysis only examines a single message.
Here, the goal of the reverse engineering process is to iden-
tify the different fields that appear in that message. A more
general approach considers a set of messages of a particu-
lar type. An analysis process at this level would produce
a message format specification that can include optional
fields or alternative structures for parts of the message. Fi-
nally, in the most general case, the analysis process oper-
ates on complete application sessions. In this case, it is not
sufficient to only extract message format specifications, but
also to identify the protocol state machine. Moreover, be-
fore individual message formats can be extracted, it is nec-
essary to distinguish between messages of different types

Analysis of multiple messages

When analyzing a single protocol message, our system
breaks up the byte sequence that makes up this message
into a number of fields. As mentioned previously, these
fields can be nested, and thus, are stored in a hierarchical
(tree) structure. The root node of the tree is the complete
message. Both length field and delimiter analyses are used
to identify parts of the message as scope fields, delimited
fields, length fields, or target fields. Input bytes that cannot
be attributed to any such field are treated as individual byte
fields or, if they are in a delimiter scope and end at a delim-
iter, as arbitrary-length token fields. We refer to fields that
contain other, embedded fields as complex fields. Fields
that cannot be divided further are called basic fields. In
the tree hierarchy, complex fields are internal nodes, while
basic fields are leaf nodes.
It is possible, and common, that different message in-
stances of the same type do not contain the same fields in
the same order. For example, in a HTTP GET request, the
client can send multiple header lines with different key-
words. Moreover, these headers can appear in an almost
arbitrary order. Another example is a DNS query where
the requested domain name is split into a variable num-
ber of parts, depending on the number of dots in the name.
By analyzing only a single message, there is no way for the
system to determine whether a protocol requires the format
to be exactly as seen, or whether there is some flexibility
in the way fields can be arranged. To address this ques-
tion, and to deliver a general and precise message format
specification, information from multiple messages must be
combined.