30-10-2010, 12:17 PM
Dispatcher: Botnet Infiltration Using Automatic Protocol Reverse Engineering
Final Year Seminar Report
Submitted by
Arjun S R
Department of Computer Science and Engineering,
College of Engineering,
Trivandrum,
2010
Dispatcher Botnet Infilltration Using Automatic Protocol Reverse Engineering.pdf (Size: 591.5 KB / Downloads: 105)
CONTENTS
Contents
1 Introduction 2
2 Overview & Problem Denition 5
2.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Message Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Field Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Problem denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Obtaining an execution trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Handling obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Field Semantics Inference 8
3.1 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Extracting Message Format of Sent Packets 11
4.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.1 Loop analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Callstack Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 Buer Liveness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Buer Deconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Program locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Dependency Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.3 Extracting the buer structure . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Field Attribute Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Length elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Delimiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.4 Variable-length elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.5 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Handling Encrypted Messages 18
5.1 Identifying encoding functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Identifying the buers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Evaluation 20
6.1 Evaluation on MegaD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1.1 MegaD C&C Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1.2 Message format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3
6.1.3 Attribute detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.4 Field semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.5 Rewriting a MegaD dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Evaluation on Open Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 Message format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.2 Errors on leaf elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.3 Attribute detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2.4 Field semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Detecting Encoding Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Conclusion 26
Bibliography 28
List of Tables
1 Field semantics identied by Dispatcher for both received and sent messages.
Stored data represents data received over the network and written to the lesystem
or the Windows registry, as opposed to data read from those sources . . . . . . . 9
2 Field attributes used in the message eld tree . . . . . . . . . . . . . . . . . . . . 16
3 Comparison of the message eld tree for sent messages extracted by Dispatcher
and Wireshark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Evaluation of the detection of encoding functions . . . . . . . . . . . . . . . . . . 25
List of Figures
1 Message eld tree for the MegaD Host-Information message . . . . . . . . . . . . 6
2 Buer deconstruction for the MegaD message in Figure1. Each box is a memory
buer starting at address Bx with the byte length in brackets. Note the similarity
with the upsidedown version of Figure1 . . . . . . . . . . . . . . . . . . . . . . . 11
3 Dependency chain for B2 in Figure2. The start address of B2 is A. . . . . . . . . 14
Abstract
Automatic protocol reverse-engineering is important for many security applications, including the
analysis and defense against botnets. Understanding the command-and-control (C&C) protocol
used by a botnet is crucial for anticipating its repertoire of nefarious activity and to enable active
botnet inltration. Frequently, security analysts need to rewrite messages sent and received by
a bot in order to contain malicious activity and to provide the botmaster with an illusion of
successful and unhampered operation. To enable such rewriting, we need detailed information
about the intent and structure of the messages in both directions of the communication despite
the fact that we generally only have access to the implementation of one endpoint, namely
the bot binary. Current techniques cannot enable such rewriting. In this paper, we propose
techniques to extract the format of protocol messages sent by an application that implements a
protocol specication, and to infer the eld semantics for messages both sent and received by the
application. Our techniques enable applications such as rewriting the C&C messages for active
botnet inltration. We implement our techniques into Dispatcher, a tool to extract the message
format and eld semantics of both received and sent messages. We use Dispatcher to analyze
MegaD, a prevalent spam botnet employing a hitherto undocumented C&C protocol, and show
that the protocol information extracted by Dispatcher can be used to rewrite the C&C messages.
1
1 INTRODUCTION
1 Introduction
Automatic protocol reverse-engineering techniques enable extracting the protocol specication
of unknown or undocumented application-level protocols. A detailed protocol specication can
enhance many security applications such as fuzzing, application ngerprinting, deep packet in-
spection, or signature-based ltering.
One important application for automatic protocol reverse engineering is the analysis and
inltration of botnets. Botnets, large networks of infected computers under control of an at-
tacker, are one of the dominant threats in the Internet today. They enable a wide variety of
abusive or fraudulent activities, such as spamming, phishing, click-fraud, and distributed denial-
of-service (DDoS) attacks. At the heart of a botnet is its command-and-control (C&C) protocol,
which enables a bot to locate rendezvous points in the network and provides the botmaster with
a means to coordinate malicious activity in the bot population. Automatic protocol reverse-
engineering can be used for understanding the C&C protocol used by a botnet, revealing a
wealth of information about the capabilities of its bots and the overall intent of the botnet.
In addition to understanding its C&C protocol, an analyst may also be interested in
interacting actively with the botnet. Previous work analyzed the economics of the Storm botnet
by rewriting the commands sent to the bots. Other times, an analyst may want to rewrite
messages sent upstream by the bots, such as when a sites containment policy requires the analyst
to make bots lie about their capabilities and achievements. For example, the analyst may want
to rewrite a capability report sent by the bot to make the botmaster believe that the bot can
send email even if all the outgoing SMTP connections by the bot are actually blocked, or that
the bot is connected to the Internet using a high-speed LAN when in reality it is tunnelling
trac through a low-throughput connection.
To successfully rewrite a C&C message, an analyst rst needs to understand the goal of
the message, its eld structure, and the location of elds carrying relevant information to rewrite.
While older botnets build their C&C protocol on top of IRC, many newer botnets use customized
or proprietary protocols.
Analyzing such C&C protocols is challenging. Manual protocol reverse-engineering of
such protocols is time-consuming and error-prone. Furthermore, previous automatic protocol
reverse engineering techniques have limitations that prevent them from enabling rewriting of
such protocols. Techniques that use network trac as input are easily hampered by obfuscation
or encryption. Techniques that rely on observing how a communication end point (client or
server) processes a received input present two major limitations. First, given a program they
can only extract information about one side of the dialog, i.e., the received messages. To obtain
complete understanding of the protocol, they require access to the implementation of both sides of
the dialog. Unfortunately, when studying a botnet analysts often have access only to the bot side
of the communication. This is true also for other (benign) applications such as instant-messaging
2
1 INTRODUCTION
solutions where the clients are freely available but the servers are not. Second, current binary-
based techniques do not address extracting the semantic information from the protocol messages.
Semantic information is fundamental for understanding the intent of a message, and therefore
to identify what parts of a dialog to rewrite. Semantic information is needed for both text-based
and binary-based protocols. Although for text-based protocols an analyst can sometimes infer
such information from the content, this is often not so. For example, an ASCII-encoded integer
in a text-based protocol can represent among others a length, a port number, a sleep timer, or
a check sum value.
In this paper we present novel techniques to extract the message format for messages sent
by an application, which enable extracting the protocol message format from just one side of
the communication. New techniques are needed because current techniques for extracting the
message format of received messages rely on tainting the network input and monitoring how the
tainted data is used by the program. However, most data in sent messages does not come from
tainted network input. Instead, we use the following intuition: programs store elds in memory
buffers and construct the messages to be sent by combining those buffers together. Thus, the
structure of the buffer holding the message to be sent represents the inverse of the structure
of that message. We also present novel techniques to infer the eld semantics in messages sent
and received by an application. Our type-inference-based techniques leverage the rich semantic
information that is already available in the program by monitoring how data in the received
messages is used at places where the semantics are known, and how the sent messages are built
from data with known semantics. In addition, we propose extensions to a recently proposed
technique to identify the buers holding the unencrypted received message. Our extensions
generalize the technique to support applications where there is no single boundary between
decryption and protocol processing, and to identify the buers holding the unencrypted sent
message.
We implement our techniques into Dispatcher, a tool to extract the message format and
eld semantics of both received and sent messages. We use Dispatcher to analyze the C&C pro-
tocol used by MegaD, one of the most prevalent spam botnets in use today. To the best of our
knowledge, MegaDs proprietary, encrypted, binary C&C protocol has not been previously docu-
mented and thus presents an ideal test case for our system. We show that the C&C information
extracted by Dispatcher can be used to rewrite the MegaD C&C messages. In addition, we use
four open protocols (HTTP, FTP, ICQ, and DNS) to compare the message format automatically
extracted by Dispatcher with the one extracted by Wireshark, a state-of-the-art protocol parser
that contains manually written protocol grammars.
In summary, our contributions are the following:
We propose novel techniques to extract the format of the protocol messages sent by an ap-
plication. Previous work could only extract the format of received messages. Our techniques
enable extracting the complete protocol format even when only one endpoints implemen-
3
1 INTRODUCTION
tation of the protocol is available.
We present techniques to infer the eld semantics for messages sent and received by an ap-
plication. Our type-inference based techniques leverage the wealth of semantic information
available in the program.
We design and develop Dispatcher, a tool that implements our techniques and automatically
extracts from one endpoints implementation the message format and associated semantics
for both sides of a protocol. We use Dispatcher to analyze MegaD, a prevalent spam botnet
that uses an encrypted binary C&C protocol previously not publicly documented.
We show that the protocol information that Dispatcher extracts can be used to rewrite
MegaD C&C messages, thereby enabling active botnet inltration.
4
2 OVERVIEW & PROBLEM DEFINITION
2 Overview & Problem Denition
In this section we dene the problems addressed in the paper and give an overview of our approach
2.1 Scope
The goal of automatic protocol reverse-engineering is to extract the protocol format, which cap-
tures the structure of all messages that comprise the protocol, and the protocol state machine,
which captures the sequences of messages that represent valid sessions of the protocol. Extracting
the protocol format usually comprises two steps. First, given a set of input protocol messages,
extract the message format of each message. Second, given the set of message formats, identify
optional, repetitive and alternative elds, and infer the protocol format, which encompasses the
multiple message types that comprise the protocol. Dierent representations for the protocol
format are possible, e.g., as a regular expression or a BNF grammar.
This paper deals only with the rst step of the protocol format extraction, extracting the
message format for a given message, which is a pre-requisite for extracting both the protocol
format and the protocol state-machine.
2.2 Message Format
The message format is captured in the message eld tree, a tree in which each node represents a
eld in the message1. A child node represents a subeld of its parent, and thus corresponds to
a subrange of the parent eld in the message. The root node represents the complete message,
the internal nodes represent hierarchical elds and the leaf nodes represent the smallest semantic
units in the message. Each node contains an attribute list, where each attribute captures proper-
ties about the eld such as the eld range (start and end positions in the given message), the eld
length (xed or variable), as well as inter-eld dependencies (such as length elds or checksums).
Figure 1 shows the message eld tree for a C&C message used by MegaD to communicate back to
the C&C server information about the bots host. The root node represents the message, which
is 58 bytes long. There are two hierarchical elds: the payload, which is the encrypted part of
the message, and the host information, which contains leaf elds representing host data such as
the CPU identier and the IP address. The attributes capture that the Msg length eld is the
length of the payload and the Length eld is the length of the Host info eld.
2.3 Field Semantics
One important property of a eld is its semantics, i.e, the type of data that the eld contains.
Typical eld semantics are lengths, timestamps, checksums, hostnames, and lenames. Inferring
5
2.4 Problem denition 2 OVERVIEW & PROBLEM DEFINITION
Figure 1: Message eld tree for the MegaD Host-Information message
the eld semantics is fundamental to understand what a message does and to identify interesting
parts of a dialog to rewrite. Field semantics are captured in the message eld tree as an attribute
for each eld and can be used to label the elds. For example, in Figure1 the semantics inference
states that range [48:51] contains an IP address and range [6:13] contains some data previously
received over the network. We use this information to label the corresponding elds BotID and
IP addr.
2.4 Problem denition
In this paper we address two problems:
1. Extracting the message eld tree for the messages sent by the application,and
2. Inferring eld semantics, that is, annotating the nodes in the message eld tree, for both
received and sent messages, with a eld semantics attribute.
2.5 Approach
The output buer contains the message about to be sent at the time that the function that sends
data over the network is invoked. As a special case, for encrypted protocols the output buer
contains the unencrypted data at the time the encryption routine is invoked. To extract the
format of sent messages we use the following intuition: programs store elds in memory buers
and construct the messages to be sent by combining those buers together. Thus, the structure
of the output buer represents the inverse of the structure of the sent message. We propose
buer deconstruction, a technique to build the message eld tree of a sent message by analyzing
6
2.6 Obtaining an execution trace 2 OVERVIEW & PROBLEM DEFINITION
how the output buer is constructed from other memory buers in the program. We present
our message format extraction techniques for sent messages in Section 4 and our handling of
encrypted protocols in Section 5.
To infer the eld semantics, we use type-inference-based techniques that leverage the
observation that many functions and instructions used by programs contain known semantic
information that can be leveraged for eld semantics inference. When a eld in the received
message is used to derive the arguments of those functions or instructions (i.e., semantic sinks),
we can infer its semantics. When the output of those functions or instructions (i.e., semantic
sources) are used to derive some eld in the output buer, we can infer its semantics.
We have developed Dispatcher, a tool that enables analyzing both sides of the communi-
cation of an unknown protocol, even when an analyst has access only to the application imple-
menting one side of the dialog. Dispatcher integrates previously proposed techniques to extract
the message format of received messages, as well as our novel techniques to extract the message
format of sent messages, and to infer eld semantics in both received and sent messages. We
show that the information extracted by Dispatcher enables rewriting MegaDs C&C messages.
2.6 Obtaining an execution trace
The input to our message format extraction and eld semantics inference techniques is execution
traces taken by monitoring the program while it is involved in a network dialog using the un-
known protocol. To monitor the program we use a custom analysis environment that implements
dynamic taint tracking and produces instruction-level execution traces containing all instructions
executed, the content of the operands, and the associated taint information. To analyze the pro-
tocol used by malware samples (e.g., the C&C protocol of a botnet) safely, we need to execute
them in a specialized analysis network with custom containment policies.
An execution trace contains the processing of multiple messages sent and received by the
program during the network dialog. We split the execution trace into per-message traces by
monitoring the programs use of networking functions that read or write data from sockets. We
split the execution trace into two traces every time that the program makes a successful call to
write data to a socket (e.g., send) and every time that the program makes a successful call to
read data from a socket (e.g., recv), except when the argument dening the maximum number
of bytes to read is tainted. In this case, the read data is considered part of the previous message
and the trace is not split. This handles the case of a program reading a eld conveying the length
of the message payload and using this value to read the payload itself.
7
2.7 Handling obfuscation 3 FIELD SEMANTICS INFERENCE
2.7 Handling obfuscation
Our dynamic analysis approach is resilient to obfuscation techniques designed to thwart static
analysis such as binary packing and inlining unnecessary instructions. However, a premise of our
approach is that we can observe a samples processing of the received data in our analysis envi-
ronment (based on a system emulator). Thus, similar to all dynamic approaches, our approach
can be evaded using techniques that detect virtualized or emulated environments. Also, while
our techniques work well on MegaD, we expect malware to adapt. Thus, we have designed our
techniques to target fundamental properties so that they are as resilient as possible to obfusca-
tion. Nevertheless, the techniques proposed in this paper are not specic to malware analysis
and can be used to analyze any unknown or undocumented protocols.
3 Field Semantics Inference
In this section we present our technique to identify the eld semantics of both received and sent
messages.
The intuition behind our type-inference-based techniques is that many functions and in-
structions used by programs contain rich semantic information. We can leverage this information
to infer eld semantics by monitoring if received network data is used at a point where the se-
mantics are known, or if data to be sent to the network has been derived from data with known
semantics. Such inference is very general and can be used to identify a broad spectrum of eld
semantics including timestamps, lenames, hostnames, ports, IP addresses, and many others.
The semantic information of those functions and instructions is publicly available in their pro-
totypes, which describe their goal as well as the semantics of its inputs and outputs. Function
prototypes can be found, for example, at the Microsoft Developer Network or the standard C
library documentation. For instructions, one can refer to the system manufacturers manuals.
3.1 Techniques
For received messages, Dispatcher uses taint propagation to monitor if a sequence of bytes from
the received message is used in the arguments of some selected function calls and instructions,
for which the system has been provided with the functions prototype. The sequence of bytes
in the received message can then be associated with the semantics of the arguments as dened
in the prototype. For example, when a program calls the connect function Dispatcher uses the
functions prototype to check if any of the arguments on the stack is tainted. The functions
prototype tells us that the rst argument is the socket descriptor, the second one is an address
structure that contains the IP address and port of the host to connect to, and the third one is
the length of the address structure. If the memory locations that correspond to the IP address to
8
3.1 Techniques 3 FIELD SEMANTICS INFERENCE
connect to in the address structure are tainted from four bytes in the input, then Dispatcher can
infer that those four bytes in the input message (identied by the oset in the taint information)
form a eld that contains an IP address to connect to. Similarly, if the memory locations that
correspond to the port to connect to have been derived from two bytes in the input message, it
can identify the position of the port eld in the input message.
For sent messages, Dispatcher taints the output of selected functions and instructions using
a unique source identier and oset pair. For each tainted sequence of bytes in the output buer,
Dispatcher identies from which taint source the sequence of bytes was derived. The semantics
of the taint source (return values) are given by the functions or instructions prototype, and can
be associated to the sequence of bytes. For example, if a program uses the rdtsc x86 instruction,
we can leverage the knowledge that it takes no input and returns a 64-bit output representing
the current value of the processors time-stamp counter, which is placed in registers EDX:EAX
[4]. Thus, at the time of execution when the program uses rdtsc, Dispatcher taints the EDX and
EAX registers with a unique source identier and oset pair. This pair uniquely labels the taint
source to be from rdtsc, and the osets identify each byte in the rdtsc stream (osets 0 through
7 for the rst use).
A special case of this technique is cookie inference. A cookie represents data from a received
message that propagates unchanged to the output buer (e.g., session identiers). Thus, a cookie
is simultaneously identied in the received and sent messages.
Field Semantics Received Sent
Cookies Yes Yes
IP addresses Yes Yes
Error Codes No Yes
File data No Yes
File Information No Yes
Filenames Yes Yes
Hash/Checksums Yes Yes
Hostnames Yes Yes
Host Information No Yes
Keyboard input No Yes
Keywords Yes Yes
Length Yes Yes
Table 1: Field semantics identied by Dispatcher for both received and sent messages. Stored
data represents data received over the network and written to the lesystem or the Windows
registry, as opposed to data read from those sources
9
3.2 Implementation 3 FIELD SEMANTICS INFERENCE
3.2 Implementation
To identify eld semantics Dispatcher uses an input set of function and instruction prototypes.
By default, Dispatcher includes over one hundred functions and a few instructions for which
we have already added the prototypes by searching online repositories. To identify new eld
semantics and their corresponding functions, we examine the external functions called by the
program in the execution trace. Table1 shows the eld semantics that Dispatcher can infer from
received and sent messages using the predened functions.
10
4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
4 Extracting Message Format of Sent Packets
The message eld tree captures the hierarchical eld structure of the message as well as the eld
properties encoded in attributes. To extract the message eld tree of a sent message we rst
reverse-engineer the structure of the output message and output a message eld tree with no
eld attributes. Then, we use specic techniques to identify the eld attributes, such as how
to identify the eld boundary (xed-length, delimiter, length eld) and the keywords present in
each eld.
Figure 2: Buer deconstruction for the MegaD message in Figure1. Each box is a memory buer
starting at address Bx with the byte length in brackets. Note the similarity with the upsidedown
version of Figure1
A eld is a sequence of consecutive bytes in a message with some meaning. A memory
buer is a sequence of consecutive bytes in memory that stores data with some meaning. To
reverse engineer the structure of the output message we cannot use current techniques to extract
the message format of received messages because they rely on tainting the network input and
monitoring how the tainted data is used by the program. Most data in sent messages does
not come from the tainted network input. Instead, we use the following intuition: programs
store elds in memory buers and construct the messages to be sent by combining those buers
together. Thus, the structure of the output buer represents the inverse of the message eld tree
of the sent message. We propose buer deconstruction, a technique to build the message eld tree
of a sent message by analyzing how the output buer is constructed from other memory buers
in the program. Figure2 shows the deconstruction of the output buer holding the message in
Figure1. Note the similarity between Figure1 and the upside-down version of Figure2.
Extracting the message format of sent messages is a three-step process. In the preparation
step, Dispatcher makes a forward pass over the execution trace to extract information about the
loops that were executed, the liveness of buers in the stack, and the call stack information at
each point in the execution trace. It also builds an index of the execution trace to enable random
11
4.1 Preparation 4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
access to any instruction. We present the preparation in Section 4.1. The core of the message
format extraction is the buer deconstruction step, which is a recursive process in which one
memory buer is deconstructed at a time by extracting the sequence of memory buers that
comprise it. The process is started with the output buer and recurses until there are no more
buers to deconstruct. Dispatcher implements buer deconstruction as a backward pass over an
execution trace. Since the structure of the output buer is the inverse of the message eld tree
for the sent message, every memory buer that forms the output buer (and, recursively, the
memory buers that form them) corresponds to a eld in the message eld tree. For example,
deconstructing the output buer in Figure2 returns a sequence of two buers, a 2-byte buer
starting at oset zero in the output buer (B1) and a 56-byte buer starting at oset 2 in the
output buer (B2). Correspondingly a eld with range [0:1] and another one with range [2:57]
are added to the no-attribute message eld tree. Thus, buer deconstruction builds the no-
attribute message eld tree as it recurses into the output buer structure. We present buer
deconstruction in Section 4.2. Finally, eld attribute inference identies length elds, delimiters,
keywords, arrays and variable-length elds and adds the information into attributes for the
corresponding elds in the message eld tree. We present eld attribute inference in Section 4.3.
4.1 Preparation
During preparation, Dispatcher makes a forward pass over the execution trace collecting infor-
mation needed by the buer deconstruction as well as the attribute inference.
4.1.1 Loop analysis
During the forward pass, Dispatcher extracts information about each loop present in the ex-
ecution trace. To identify the loops in the execution trace, Dispatcher supports two dierent
detection methods: static and dynamic. The static method extracts the addresses of the loop
head and exit conditions statically from the binary before the forward pass starts, and uses that
information during the forward pass to identify the points where any of those loops appears in
the trace. The dynamic method does not require any static processing and extracts the loops
directly during the forward pass by monitoring instructions that appear multiple times in the
same function. Both methods are complementary. While using static information is more precise
at identifying the loop exit conditions, it also requires analyzing all the modules (executable plus
dynamically link libraries) used by the application, may miss loops that contain indirection, and
cannot be applied if the unpacked binary is not available, such as in the case of MegaD. On
the other hand, the dynamic method is less accurate at identifying the loop exit conditions, but
requires no setup and can be used on any of our samples including MegaD.
12
4.2 Buer Deconstruction 4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
4.1.2 Callstack Analysis
During the forward pass, Dispatcher replicates the function stack of the program by monitoring
the function calls and returns. Given an instruction number in the execution trace, the callstack
analysis returns the innermost function that contained that instruction at that point of the
execution.
4.1.3 Buer Liveness Analysis
During the execution trace capture, Dispatcher monitors the heap allocation and free functions
used by the program. For each heap allocation it provides the instruction number in the trace,
the buer start and the size of the buer. For each heap deallocation, it species the instruc-
tion number in the trace, and the start address of the buer being freed. During the forward
pass, Dispatcher monitors the stack pointer at the function entry and return points, extracting
information about which memory locations in the stack are freed when the function returns.
This information is used by Dispatcher to determine whether two dierent writes to the same
memory address correspond to the same memory buer, since memory locations in the stack
(and occasionally in the heap) may be reused for dierent buers.
4.2 Buer Deconstruction
Buer deconstruction is a recursive process. In each iteration it deconstructs a given memory
buer into the sequence of other memory buers that comprise it. The process starts with the
output buer and recurses until there are no more buers to deconstruct. It has two parts.
First, for each byte in the given buer we build a dependency chain. Then, using the dependency
chains and the information collected in the preparation step, we extract the structure of the
given buer. The input to each buer deconstruction iteration is a buer dened by its start
address in memory, its length, and the instruction number in the trace where the buer was last
written. The start address and length of the output buer are obtained from the arguments of
the function that sends the data over the network (or the encryption function). The instruction
number to start the analysis is the instruction number for the rst instruction in the send (or
encrypt) function. In the remainder of this section we introduce what locations and dependency
chains are and present how they are used to deconstruct the output buer.
4.2.1 Program locations
We dene a program location to be a onebyte-long storage unit in the programs state. We
consider four types of locations: memory locations, register locations, immediate locations, and
constant locations, and focus on the address of those locations, rather than on its content. Each
13
4.2 Buer Deconstruction 4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
Figure 3: Dependency chain for B2 in Figure2. The start address of B2 is A.
memory byte is a memory location indexed by its address. Each byte in a register is a register
location, for example, there are 4 locations in EAX:EAX(0) or AL, EAX(1) or AH, EAX(2), and
EAX(3). An immediate location corresponds to a byte from an immediate in the code section
of some module, indexed by the oset of the byte with respect to the beginning of the module.
Constant locations represent the output of some instructions that have constant output. For
example, one common instruction is to XOR one register against itself (e.g., xor %eax, %eax),
which clears the register. Dispatcher recognizes a number of such instructions and makes each
byte of its output a constant location.
4.2.2 Dependency Chains
A dependency chain for a program location is the sequence of write operations that produced
the value of the location at a certain point in the program. A write operation comprises the
instruction number at which the write occurred, the destination location (i..e, the location that
was written), the source location (i.e., the location that was read), and the oset of the written
location with respect to the beginning of the output buer. Figure3 shows the dependency chains
for the B2 buer (the one that holds the encrypted payload) in Figure2. In the gure, each box
represents a write operation, and each sequence of vertical boxes represents the dependency chain
for one location in the buer.
The dependency chain is computed in a backward pass starting at the given instruction
number. We stop building the dependency chain at the rst write operation for which the source
location is:
1. an immediate location,
2. a constant location,
14
4.2 Buer Deconstruction 4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
3. a memory location,
4. an unknown location
If the source location is part of an immediate or part of the output from some constant
output instruction, then there are no more dependencies and the chain is complete. This is
the case for the rst four bytes of B2 in Figure3. The reason to stop at a source memory
location is that we want to understand how a memory buffer has been constructed from other
memory buffers. After extracting the structure of the given buffer, Dispatcher recurses on the
buffers that form it. For example, in Figure3 the dependency chains for locations Mem(A+4)
through Mem(A+11) contains only one write operation because the source location is another
memory location. Dispatcher will then create a new dependency chain for buffer Mem(B) through
Mem(B+7). When building the dependency chains, Dispatcher only handles a small subset of x86
instructions which simply move data around, without modifying it. This subset includes move
instructions (mov,movs), move with zero-extend instructions (movz), push and pop instructions,
string stores (stos), and instructions that are used to convert data from network to host order
and vice versa such as exchange instructions (xchg), swap instructions (bswap), or right shifts
that shift entire bytes (e.g.,shr $0x8,% eax). When a write operation is performed by any other
instruction, the source is considered unknown and the dependency chain stops. Often, it is
enough to stop the dependency chain at such instructions, because the program is at that point
performing some operation on the eld (e.g., an arithmetic operation) as opposed to just moving
the content around. Since programs operate on leaf elds, not on hierarchical fields, then at that
point of the chain we have already recursed up to the corresponding leaf field in the message
field tree. For example, in Figure3 the dependency chains for the last two bytes stop at the same
add instruction. Thus, both source locations are unknown. Note that those locations correspond
to the length eld in Figure1. The fact that the program is increasing the length value indicates
that the dependency chain has already reached a leaf eld.
4.2.3 Extracting the buer structure
We call the source location of the last element in the dependency chain of a buffer location its
source. We say that two source locations belong to the same source buffer if they are contiguous
memory locations (in either ascending or descending order) and the liveness information states
that none of those locations has been freed between their corresponding write operations. If the
source locations are not in memory (e.g., register, immediate, constant or unknown location),
they belong to the same buer if they were written by the same instruction (i.e, same instruction
number).
To extract the structure for the given buffer Dispatcher iterates on the buer locations
from the buffer start to the buffer end. For each buffer location, Dispatcher checks whether
the source of the current buffer location belongs to the same source buffer as the source of the
15
4.3 Field Attribute Inference 4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
previous buffer location. If they do not, then it has found a boundary in the structure of the
buffer. The structure of the given buffer is output as a sequence of ranges that form it, where
each range states whether it corresponds to a source memory buffer.
For example, in Figure3 the source locations for Mem(A+4) and Mem(A+5) are contigu-
ous locations Mem(B) and Mem(B+1) but the source locations for Mem(A+11) and Mem(A+12)
are not contiguous. Thus, Dispatcher marks location Mem(A+12) as the beginning of a new
range. Dispatcher nds 6 ranges in B2. The rst four are shown in Figure3 and marked with
arrows at the top of the gure. Since only the third range originates from another memory
buffer, that is the only buffer that Dispatcher will recurse on to reconstruct. The last two ranges
correspond to the Host Info eld and the padding in Figure1 and are not shown in Figure3.
Once the buffer structure has been extracted, Dispatcher uses the correspondence between
buffers and elds in the analyzed message to add one eld to the message eld tree per range in
the buffer structure using the osets relative to the output buffer. In Figure3 it adds four new
elds that correspond to the Version, Type, Bot ID, and Length in Figure1.
4.3 Field Attribute Inference
The message eld tree built during the buffer deconstruction step represents the hierarchical
structure of the output message, but does not contain information about inter-eld relationships
such as if a eld represents the length of another target eld. Such additional information is
captured by the eld attributes in the message eld tree.
Attribute Value
Field Range Start oset and length in message
Field Boundary Fixed,Length and Delimiter
Field Semantics A value from Table1
Field Keywords List of keywords in eld
Table 2: Field attributes used in the message eld tree
Table2 presents the eld attributes that we identify in this paper. The eld range captures
the position of the eld in the message. The eld boundary captures how an application deter-
mines where the eld ends. Fields can be xed-length (Fixed), variable-length using a length
eld (Length), or variable-length using a delimiter (Delimiter). The eld semantics are the values
in Table1. The eld keywords attribute contains a list of all the protocol constants that appear
in the eld and their position.
The eld attributes in Table2 are similar to the ones that previous work extracts for
received messages [18, 49]. However, these techniques do not work on sent messages because they
rely on monitoring how the data received over the network is processed, when for sent messages we
can only observe how the sent messages are built. Our techniques share common intuitions with
16
4.3 Field Attribute Inference 4 EXTRACTING MESSAGE FORMAT OF SENT PACKETS
previous techniques: both try to capture the fundamental properties of the dierent protocol
elements. In fact, some attribute values are more dicult to extract for sent messages than
for received messages. For example, many elds that a protocol specication would dene as
variable length may encode some xed-length data in a specic implementation. For example
the Server header is variable-length based on the HTTP specication. However, a given HTTP
server implementation may have hard-coded the Server string in the binary, making the eld
xed-length for this implementation. Leveraging the availability of multiple implementations of
the same protocol could help in such cases. We plan to study this in future work.
4.3.1 Keywords
Keywords are constants that appear in network messages. To identify constants in the output
buer, Dispatcher taints the memory region that contains the module (and DLLs shipped with
the main binary) with a specic taint source, eectively tainting both immediates in the code
section as well as data stored in the data section. Locations in the output buer tainted from
this source are considered keywords.
4.3.2 Length elds
Dispatcher uses three dierent techniques to identify length elds in sent messages. The intuition
behind the techniques is that length elds can be computed either by incrementing a counter as
the program iterates on the eld, or by subtracting pointers to the beginning and the end of the
buer. The intuition behind the rst two techniques is that those arithmetic operations translate
into an unknown source at the end of the dependency chains for the buer locations corresponding
to the length eld. When a dependency chain ends in an unknown source, Dispatcher checks
whether the instruction that performs the write is inside a known function that computes the
length of a string (e.g., strlen) or is a subtraction of pointers to the beginning and end of the
buer. The third technique tries to identify counter increments that do not correspond to well-
known string length functions. For each buer it uses the loop information to identify if most
writes to the buer belong to the same loop. If they do, then it uses the techniques in to extract
the loop induction variables. For each induction variable it computes the dependency chain and
checks whether it intersects the dependency chains from any output buer locations that precede
the locations written in the loop (since a length eld always has to precede its target eld). Any
intersecting location is part of the length eld for the eld processed in the loop.
4.3.3 Delimiters
Delimiters are constants used by protocols to mark the boundary of variable-length elds. Thus,
it is dicult to dierentiate a delimiter from any another constant in the output message. To
17
5 HANDLING ENCRYPTED MESSAGES
identify delimiters, Dispatcher looks for constants that appear multiple times in the same message
or appear at the end of multiple messages in the same session (three appearances are required).
Constants can be identied by checking the osets of the taint information for keyword identi-
cation. If the delimiters come from the data section, they can also be identied by checking
whether the source address of all instances of the constant comes from the same buer.
4.3.4 Variable-length elds
Dispatcher marks elds that precede a delimiter, and target elds for previously identied length
elds as variable-length elds. It also marks as variable-length elds derived from semantic
sources that are known to have variable length such as le data. All others are marked as
xed-length.
4.3.5 Arrays
The intuition behind identifying arrays of records is that they are written in loops, one record at
a time. Dispatcher uses the loop information extracted during preparation to identify loops that
write multiple consecutive elds. Then, it adds to the message eld tree one Array eld with the
range being the combined range of all the consecutive elds written in the loop, and one Record
eld per range of bytes written in each iteration of the loop.
5 Handling Encrypted Messages
Similar to previous work, our protocol reverse engineering techniques work on unencrypted data.
Thus, when reverse-engineering encrypted protocols we need to address two problems. First, for
received messages, we need to identify the buers holding the unencrypted data at the point that
the decryption has nished since buers may only hold the decrypted data for a brief period of
time. Second, for sent messages, we need to identify the buers holding the unencrypted data
at the point that the encryption is about to begin. Once the buers holding the unencrypted
data have been identied, protocol reverse engineering techniques can be applied on them, rather
than on the messages received or about to be sent on the wire.
Recent work has looked at the problem of reverse-engineering the format of received en-
crypted messages [39, 48]. Since the application needs to decrypt the data before using it, those
approaches monitor the applications processing of the encrypted message and attempt to locate
the buers that contain the decrypted data at the point that the decryption has nished. Those
approaches do not address the problem of nding the buers holding the unencrypted data before
it is encrypted, which is also required in our case.
18
5.1 Identifying encoding functions 5 HANDLING ENCRYPTED MESSAGES
In this work we present two extensions to the technique presented in ReFormat [48].
First, ReFormat can only handle applications where there exists a single boundary between
decryption and normal protocol processing. However, multiple such boundaries may exist. As
shown in Figure 1 MegaD messages comprise two bytes with the message length, followed by
the encrypted payload. After checking the message length, a MegaD bot will decrypt 8 bytes
from the encrypted payload and process them, then move to the next 8 bytes and process them,
and so on. In addition, some messages in MegaD also use compression and the decryption
and decompression operations are interleaved. Thus, there is no single program point where
all data in a message is available unencrypted and uncompressed. Consequently, we extend the
technique to identify every instance of encryption, hashing, compression, and obfuscation, which
we generally term encoding functions. Second, ReFormat was not designed to identify the buers
holding the unencoded (unencrypted) data before encoding (encryption). Thus, we extend the
technique to also cover this case. We present the generalized technique next.
5.1 Identifying encoding functions
To identify every instance of an encoding function we have simplied the process in ReFormat
by removing the cumulative metric, the use of tainted data, and the concept of leaf functions.
The extended technique applies the intuition in ReFormat that the decryption process contains
an inordinate number of arithmetic and bitwise operations to encoding functions. It works as
follows. Dispatcher makes a forward pass over the input execution trace replicating the callstack
of the application by monitoring the call and return instructions. For each function it computes
the ratio between the number of arithmetic and bitwise operations over the total number of
instructions in the function. The ratio includes only the functions own instructions. It does
not include instructions belonging to any invoked subfunctions. The ratio is computed for each
appearance of the function in the trace. Any function that executes a minimum number of
instructions and has a ratio larger than a pre-dened threshold is
agged by Dispatcher as an
instance of a encoding function. In our experiments, the threshold is set to 0.55 and the minimum
number of instructions is 20. Our evaluation results in Section 6.3 show that the generalized
technique identies all instances of the decryption and encryption functions in our MegaD traces
and that the false positive rate of the technique is 0.002%.
5.2 Identifying the buers
To identify the buers holding the unencrypted data before encryption we compute the read set
for the encryption routine, the set of locations read inside the encryption routine before being
written. The read set for the encryption routine includes the buers holding the unencrypted
data, the encryption key, and any hard-coded tables used by the routine. We can dierentiate
the buers holding the unencrypted data because their content varies between multiple instances
19
6 EVALUATION
of the same function. To identify the buers holding the unencrypted data after decryption we
compute the write set for the decryption routine, the set of locations written inside the decryption
routine and read later in the trace.
6 Evaluation
In this section we evaluate our techniques on the MegaD C&C protocol, as well as a number of
open protocols.
6.1 Evaluation on MegaD
MegaD uses a proprietary, encrypted, binary protocol previously not understood. Our MegaD
evaluation has two parts. We rst describe the information obtained by Dispatcher on the C&C
protocol used by MegaD, and then show how the information extracted by Dispatcher can be
used to rewrite a C&C dialog.
6.1.1 MegaD C&C Protocol
The MegaD C&C protocol uses port 443 over TCP for transport, employing a proprietary en-
cryption algorithm instead of the SSL routines for HTTPS commonly used on that port. Our
network traces show our MegaD bot communicating with three entities: the C&C server that
the bot periodically probes for new commands; the SMTP test server, an SMTP server whose
hostname is provided by the C&C server and to which the bot connects to test for spam sending
capabilities; and the spam server, whose IP address and listening port are sent by the C&C
server to the bot so that the bot can download all spam-related information such as the spam
template or the email addresses to spam. Communication with the C&C and spam servers uses
the encrypted C&C protocol, while communication with the SMTP test server uses unencrypted
SMTP. The communication model is pull-based. The bot periodically probes the botmaster by
sending a request message. The botmaster replies with two messages: one with authentication
information, and the other one with a command. The bot performs the requested action and
sends a response with its results.
6.1.2 Message format
Our MegaD C&C traces contain 15 dierent messages (7 received and 8 sent by the bot). Using
Dispatcher, we have extracted the message eld tree for messages on both directions, as well
as the associated eld semantics. All 15 messages follow the structure shown in Figure1 with a
2-byte message length followed by an encrypted payload. The payload, once decrypted,contains
20
6.1 Evaluation on MegaD 6 EVALUATION
a 2-byte eld that we term version as it is always a keyword of value 0x100 or 0x1, followed by
a 2-byte message type eld. The structure of the remaining payload depends on the message
type. To summarize the protocol format we have used the output of Dispatcher to write a
BinPac grammar that comprises all 15 messages. Field semantics are added as comments to the
grammar.
To the best of our knowledge, we are the rst to document the C&C protocol employed by
MegaD. Thus, we lack ground truth to evaluate our grammar. To verify the grammars accuracy,
we use another execution trace that contains a dierent instance of one of the analyzed dialogs.
We dump the content of all unencrypted messages and try to parse the messages using our
grammar. For this, we employed a stand-alone version of the BinPac parser included in Bro.
Using our grammar, the parser successfully parses all MegaD C&C messages in the new dialog.
In addition, the parser throws an error when given messages that do not follow the MegaD
grammar.
6.1.3 Attribute detection
The 15 MegaD messages contain no delimiters or arrays. They contain two variable-length elds
that use length elds to mark their boundaries: the compressed spam-related information (i.e.,
template and addresses) received from the spam server, and the host information eld in Figure1.
Both the length elds and variable-length elds are correctly detected by Dispatcher. The only
attributes that Dispatcher misses are the message length elds on sent messages because they
are computed using complex pointer arithmetic that Dispatcher cannot reason about.
6.1.4 Field semantics
Dispatcher identies 11 dierent eld semantics over the 15 messages: IP addresses, ports,
hostnames, length, sleep timers, error codes, keywords, cookies, stored data, padding and host
information. There are only two elds in the MegaD grammar for which Dispatcher does not
identify their semantics. Both of them happen in received messages: one of them is the message
type, which we identify by looking for elds that are compared against multiple constants in the
execution and for which the message format varies depending on its value. The other corresponds
to an integer whose value is checked by the program but apparently not used further. Note that
we identify some elds in sent messages as keywords because they come from immediates and
constants in the data section. We cannot identify exactly what they represent because we do not
see how they are used by the C&C server.
21
6.2 Evaluation on Open Protocols 6 EVALUATION
6.1.5 Rewriting a MegaD dialog
To show how our grammar enables live rewriting, we run a live MegaD bot inside our analysis
environment, which is located in a network that lters all outgoing SMTP connections for con-
tainment purposes. In a rst dialog, the C&C server sends the command to the bot ordering to
test for spam capability using a given Spam test server. The analysis network blocks the SMTP
connection causing the bot to send an error message back to the C&C server, to communicate
that it cannot send spam. No more spam-related messages are received by the bot. Then, we
start a new dialog where at the time the bot calls the encrypt function to encrypt the error
message, we stop the execution, rewrite the encryption buer with the message that indicates
success, and let the execution continue. After the rewriting the bot keeps receiving the spam-
related messages, including the spam template and the addresses to spam, despite the fact that it
cannot send any spam messages. Note that simply replaying the message that indicates success
from a previous dialog into the new dialog does not work because the success message includes
a cookie value that the C&C selects and may change between dialogs.
6.2 Evaluation on Open Protocols
In this section we evaluate our techniques on four open protocols: HTTP , DNS, FTP, and
ICQ. To this end, we compare the output of Dispatcher with that of Wireshark 1.0.5 [12] when
processing 12 messages belonging to those protocols. For each protocol we select a representative
application that implements the protocol: Apache-2.2.1 for HTTP, Bind-9.6.0 for DNS, Filezilla-
0.9.31 for FTP, and Pidgin-2.5.5 for ICQ. Note that regardless of the application being a client
(Pidgin) or a server (Bind, Apache, Filezilla), for this part of the evaluation we focus on sent
messages.
6.2.1 Message format
Wireshark is a network protocol analyzer containing manually written grammars (called dissec-
tors) for a large variety of network protocols. Although Wireshark is a mature and widely-used
tool, its dissectors have been manually generated and therefore are not completely error-free. To
compare the accuracy of the message format automatically extracted by Dispatcher to the man-
ually written ones included in Wireshark, we analyze the message eld tree output by both tools
and manually compare them to the protocol specication. Thus, we can classify any dierences
between the output of both tools to be due to errors in Dispatcher, Wireshark, or both.
We denote the set of leaf elds and the set of hierarchical elds in the message eld tree
output by Wireshark as jLWjand jHWj, respectively. jLDjand jHDjare the corresponding sets for
Dispatcher. Table3 shows the evaluation results. For each protocol and message it shows the
number of leaf elds and hierarchical elds in the message eld tree output by both tools as
22
6.2 Evaluation on Open Protocols 6 EVALUATION
Table 3: Comparison of the message eld tree for sent messages extracted by Dispatcher and Wireshark
Wireshark Dispatcher Errors
Protocol Message Type jLW j jHW j jLD j jHD j jE(LW) j jE(LD) j jE(HW) j jE(HD) j
HTTP GET reply 11 1 22 0 11 1 0 1
POST reply 11 1 22 0 11 1 0 1
DNS A reply 27 4 28 0 1 0 0 4
FTP Welcome0 2 1 3 1 1 0 0 0
Welcome1 2 1 3 1 1 0 0 0
Welcome2 2 1 3 1 1 0 0 0
USER reply 2 1 3 1 1 1 0 0
PASS reply 2 1 2 0 1 1 0 1
SYST reply 2 1 2 0 1 1 0 1
ICQ New Connection 5 0 5 0 0 0 0 0
AIM sign-on 11 3 15 3 5 0 0 0
AIM log-on 46 15 46 15 0 0 0 0
Total 123 30 154 22 34 5 0 8
23
6.2 Evaluation on Open Protocols 6 EVALUATION
well as the result of the manual classication of its errors. Here, jLWjand jE(LD)jrepresent the
number of errors on leaf elds in the message eld tree output by Wireshark and Dispatcher
respectively. Similarly, jE(HW)jand jE(HD)jrepresent the number of errors on hierarchical elds.
The results show that Dispatcher outperforms Wireshark when identifying leaf elds.
This surprising result is due to the inconsistencies between the dierent dissectors in Wireshark
when identifying delimiters. Some dissectors do not add delimiter elds to the message eld
tree, some concatenate them to the variable-length eld for which they mark the boundary,
while others treat them as separate elds. After checking the protocol specications, we believe
that delimiters should be treated as their own elds in all dissectors. The results also show
that Wireshark outperforms Dispatcher when identifying hierarchical elds. This is due to the
program not using loops to write the arrays because the number of elements in the array is known
or is small enough that the compiler has unrolled the loop.
Overall, Dispatcher outperformed Wireshark for the given messages. Note that we do not
claim that Dispatcher is generally more accurate than Wireshark since we are only evaluating
a limited number of protocols and messages. However, the results show that the accuracy of
the message format automatically extracted by Dispatcher can rival that of Wireshark, without
requiring a manually generated grammar.
6.2.2 Errors on leaf elds
Here we detail the errors on leaf elds that we have assigned to Dispatcher. The error in the
HTTP GET reply message is in the Status-Line. The HTTP/1.1 specication states that its
format is: Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF, but both
Dispatcher and Wireshark consider the Status-Code, the delimiter, and the Reason-Phrase to
belong to the same eld. The FTP specication states that a reply message comprises a com-
pletion code followed by a text string. The error in the FTP USER reply message is due to
the fact that the server echoes back the username to the client and Dispatcher identies the
username being echoed back as an additional cookie eld. The other FTP replies have the same
type of error: the response code is merged with the text string because the program keeps the
whole message (except the delimiter) in a single buer in the data section. As mentioned ear-
lier the errors on hierarchical elds are due to the program being analyzed not using loops to
write the arrays. For example, the four errors in the DNS reply correspond to the Queries,
Answers, Authoritative, and Additional sections in the message, which Bind processes separately
and therefore Dispatcher cannot identify as an array.
These errors highlight the fact that the message eld tree extracted by Dispatcher is
limited to the quality of the protocol implementation in the binary, and may dier from the
specication.
24
6.3 Detecting Encoding Function 6 EVALUATION
No: of traces No: of functions True Positive False Positive False Positive Rate
20 3,569,773(22,379) 4,874 87(9) 0.002%
Table 4: Evaluation of the detection of encoding functions
6.2.3 Attribute detection
The 12 messages contain 14 length elds, 43 delimiters, 57 variable-length elds, and 3 arrays.
Dispatcher misses 8 length elds because their value is hard-coded in the program. Thus, their
target variable-length elds are considered xedlength. Out of the 43 delimiters Dispatcher
only misses one, which corresponds to a null byte marking the end of a cookie string that was
considered part of the string. Dispatcher correctly identies all other variable-length elds. Out
of 3 arrays, Dispatcher misses one formed by the Queries, Answers, Authoritative, and Additional
sections in the DNS reply, which Bind processes separately and therefore cannot be identied by
Dispatcher.
6.2.4 Field semantics
Dispatcher correctly identies all semantic information in the sent messages, except the 3 pointers
in the DNS reply, used by the DNS compression method, which are computed using pointer
arithmetic that Dispatcher cannot reason about.
6.3 Detecting Encoding Function
To evaluate the detection of encoding functions presented in Section 5 we perform the following
experiment. We obtain 20 execution traces from multiple programs that handle network data.
Five of these traces process encrypted and compressed functions, four of them are from MegaD
sessions and the other one is from Apache while handling an HTTPS session. MegaD uses its own
encryption algorithm and the zlib library for compression and Apache uses SSL with AES and
SHA-18. The remaining 15 execution traces are from a variety of programs including browsers
(Internet Explorer 7, Safari 3.1, and Google Chrome 1.0), network servers (Bind, Atphttpd), and
services embedded in Windows (RPC, MSSQL).
Dispatcher correctly identies all encoding functions in the execution traces of MegaD and
Apache-SSL. In the MegaD traces, all instances of three unique encoding functions are identied:
the decryption routine, the encryption routine, and a key generation routine that generates the
encryption and decryption keys from a seed value in the binary before calling the encryption
or decryption routines. In addition, in the traces that process messages with compressed data,
Dispatcher
ags a fourth function that corresponds to the in
ate function in the zlib library,
which is statically linked into the