Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Privacy-Preserving Detection of Sensitive Data Exposure
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=73021]


Abstract— Statistics from security firms, research institutions
and government organizations show that the number of data-leak
instances have grown rapidly in recent years. Among various
data-leak cases, human mistakes are one of the main causes
of data loss. There exist solutions detecting inadvertent sensitive
data leaks caused by human mistakes and to provide alerts
for organizations. A common approach is to screen content
in storage and transmission for exposed sensitive information.
Such an approach usually requires the detection operation to
be conducted in secrecy. However, this secrecy requirement is
challenging to satisfy in practice, as detection servers may be
compromised or outsourced. In this paper, we present a privacypreserving
data-leak detection (DLD) solution to solve the issue
where a special set of sensitive data digests is used in detection.
The advantage of our method is that it enables the data owner to
safely delegate the detection operation to a semihonest provider
without revealing the sensitive data to the provider. We describe
how Internet service providers can offer their customers DLD as
an add-on service with strong privacy guarantees. The evaluation
results show that our method can support accurate detection
with very small number of false alarms under various data-leak
scenarios.


INTRODUCTION
ACCORDING to a report from Risk Based
Security (RBS) [2], the number of leaked sensitive
data records has increased dramatically during the last
few years, i.e., from 412 million in 2012 to 822 million
in 2013. Deliberately planned attacks, inadvertent leaks
(e.g., forwarding confidential emails to unclassified email
accounts), and human mistakes (e.g., assigning the wrong
privilege) lead to most of the data-leak incidents [3].
Detecting and preventing data leaks requires a set of
complementary solutions, which may include data-leak
detection [4], [5], data confinement [6]–[8], stealthy malware
detection [9], [10], and policy enforcement [11].
Network data-leak detection (DLD) typically performs deep
packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze
payloads of IP/TCP packets for inspecting application layer
data, e.g., HTTP header/content. Alerts are triggered when the
amount of sensitive data found in traffic passes a threshold.
The detection system can be deployed on a router or integrated
into existing network intrusion detection systems (NIDS).
Straightforward realizations of data-leak detection require
the plaintext sensitive data. However, this requirement is
undesirable, as it may threaten the confidentiality of the
sensitive information. If a detection system is compromised,
then it may expose the plaintext sensitive data (in memory).
In addition, the data owner may need to outsource the
data-leak detection to providers, but may be unwilling to reveal
the plaintext sensitive data to them. Therefore, one needs
new data-leak detection solutions that allow the providers
to scan content for leaks without learning the sensitive
information.
In this paper, we propose a data-leak detection solution
which can be outsourced and be deployed in a semihonest
detection environment. We design, implement, and
evaluate our fuzzy fingerprint technique that enhances data
privacy during data-leak detection operations. Our approach
is based on a fast and practical one-way computation on the
sensitive data (SSN records, classified documents, sensitive
emails, etc.). It enables the data owner to securely delegate the
content-inspection task to DLD providers without exposing the
sensitive data. Using our detection method, the DLD provider,
who is modeled as an honest-but-curious (aka semi-honest)
adversary, can only gain limited knowledge about the sensitive
data from either the released digests, or the content
being inspected. Using our techniques, an Internet service
provider (ISP) can perform detection on its customers’ traffic
securely and provide data-leak detection as an add-on service
for its customers. In another scenario, individuals can mark
their own sensitive data and ask the administrator of their local
network to detect data leaks for them.
In our detection procedure, the data owner computes a
special set of digests or fingerprints from the sensitive data
and then discloses only a small amount of them to the
DLD provider. The DLD provider computes fingerprints from
network traffic and identifies potential leaks in them.
To prevent the DLD provider from gathering exact knowledge
about the sensitive data, the collection of potential
leaks is composed of real leaks and noises. It is the data
owner, who post-processes the potential leaks sent back by
the DLD provider and determines whether there is any real
data leak.

In this paper, we present details of our solution and provide
extensive experimental evidences and theoretical analyses to
demonstrate the feasibility and effectiveness of our approach.
Our contributions are summarized as follows.
1) We describe a privacy-preserving data-leak detection
model for preventing inadvertent data leak in network
traffic. Our model supports detection operation delegation
and ISPs can provide data-leak detection as an
add-on service to their customers using our model.
We design, implement, and evaluate an efficient
technique, fuzzy fingerprint, for privacy-preserving
data-leak detection. Fuzzy fingerprints are special sensitive
data digests prepared by the data owner for release
to the DLD provider.
2) We implement our detection system and perform extensive
experimental evaluation on 2.6 GB Enron dataset,
Internet surfing traffic of 20 users, and also 5 simulated
real-world data-leak scenarios to measure its privacy
guarantee, detection rate and efficiency. Our results indicate
high accuracy achieved by our underlying scheme
with very low false positive rate. Our results also show
that the detection accuracy does not degrade much
when only partial (sampled) sensitive-data digests are
used. In addition, we give an empirical analysis of our
fuzzification as well as of the fairness of fingerprint
partial disclosure.
II. MODEL AND OVERVIEW
We abstract the privacy-preserving data-leak detection
problem with a threat model, a security goal and a privacy
goal. First we describe the two most important players in
our abstract model: the organization (i.e., data owner) and the
data-leak detection (DLD) provider.
• Organization owns the sensitive data and authorizes the
DLD provider to inspect the network traffic from the
organizational networks for anomalies, namely inadvertent
data leak. However, the organization does not want
to directly reveal the sensitive data to the provider.
• DLD provider inspects the network traffic for potential
data leaks. The inspection can be performed offline without
causing any real-time delay in routing the packets.
However, the DLD provider may attempt to gain knowledge
about the sensitive data.
We describe the security and privacy goals in
Section II-A and Section II-B.
A. Security Goal and Threat Model
We categorize three causes for sensitive data to appear on
the outbound traffic of an organization, including the legitimate
data use by the employees.
• Case I Inadvertent data leak: The sensitive data
is accidentally leaked in the outbound traffic by a
legitimate user. This paper focuses on detecting this
type of accidental data leaks over supervised network
channels. Inadvertent data leak may be due to human
errors such as forgetting to use encryption, carelessly forwarding
an internal email and attachments to outsiders, or due to application flaws (such as described in [12]).
A supervised network channel could be an unencrypted
channel or an encrypted channel where the content in it
can be extracted and checked by an authority. Such a
channel is widely used for advanced NIDS where MITM
(man-in-the-middle) SSL sessions are established instead
of normal SSL sessions [13].
• Case II Malicious data leak: A rogue insider or a piece of
stealthy software may steal sensitive personal or organizational
data from a host. Because the malicious adversary
can use strong private encryption, steganography or covert
channels to disable content-based traffic inspection, this
type of leaks is out of the scope of our network-based
solution. Host-based defenses (such as detecting the
infection onset [14]) need to be deployed instead.
• Case III Legitimate and intended data transfer: The
sensitive data is sent by a legitimate user intended for
legitimate purposes. In this paper, we assume that the data
owner is aware of legitimate data transfers and permits
such transfers. So the data owner can tell whether a piece
of sensitive data in the network traffic is a leak using
legitimate data transfer policies.
The security goal in this paper is to detect Case I leaks, that
is inadvertent data leaks over supervised network channels.
In other words, we aim to discover sensitive data appearance in
network traffic over supervised network channels. We assume
that: i) plaintext data in supervised network channels can
be extracted for inspection; ii) the data owner is aware of
legitimate data transfers (Case III); and iii) whenever sensitive
data is found in network traffic, the data owner can decide
whether or not it is a data leak. Network-based security
approaches are ineffective against data leaks caused by malware
or rogue insiders as in Case II, because the intruder may
use strong encryption when transmitting the data, and both the
encryption algorithm and the key could be unknown to the
DLD provider.
B. Privacy Goal and Threat Model
To prevent the DLD provider from gaining knowledge of
sensitive data during the detection process, we need to set up a
privacy goal that is complementary to the security goal above.
We model the DLD provider as a semi-honest adversary,
who follows our protocol to carry out the operations, but
may attempt to gain knowledge about the sensitive data of
the data owner. Our privacy goal is defined as follows. The
DLD provider is given digests of sensitive data from the data
owner and the content of network traffic to be examined. The
DLD provider should not find out the exact value of a piece of
sensitive data with a probability greater than 1
K , where K is an
integer representing the number of all possible sensitive-data
candidates that can be inferred by the DLD provider.
We present a privacy-preserving DLD model with a new
fuzzy fingerprint mechanism to improve the data protection
against semi-honest DLD provider. We generate digests of
sensitive data through a one-way function, and then hide
the sensitive values among other non-sensitive values via
fuzzification. The privacy guarantee of such an approach



Overview of Privacy-Enhancing DLD
Our privacy-preserving data-leak detection method supports
practical data-leak detection as a service and minimizes the
knowledge that a DLD provider may gain during the process.
Fig. 1 lists the six operations executed by the data owner and
the DLD provider in our protocol. They include PREPROCESS
run by the data owner to prepare the digests of sensitive
data, RELEASE for the data owner to send the digests to the
DLD provider, MONITOR and DETECT for the DLD provider
to collect outgoing traffic of the organization, compute digests
of traffic content, and identify potential leaks, REPORT for
the DLD provider to return data-leak alerts to the data owner
where there may be false positives (i.e., false alarms), and
POSTPROCESS for the data owner to pinpoint true data-leak
instances. Details are presented in the next section.
The protocol is based on strategically computing data
similarity, specifically the quantitative similarity between the
sensitive information and the observed network traffic. High
similarity indicates potential data leak. For data-leak detection,
the ability to tolerate a certain degree of data transformation
in traffic is important. We refer to this property as noise
tolerance. Our key idea for fast and noise-tolerant comparison
is the design and use of a set of local features that are
representatives of local data patterns, e.g., when byte b2
appears in the sensitive data, it is usually surrounded by bytes
b1 and b3 forming a local pattern b1, b2, b3. Local features
preserve data patterns even when modifications (insertion,
deletion, and substitution) are made to parts of the data. For
example, if a byte b4 is inserted after b3, the local pattern
b1, b2, b3 is retained though the global pattern (e.g., a hash
of the entire document) is destroyed. To achieve the privacy
goal, the data owner generates a special type of digests,
which we call fuzzy fingerprints. Intuitively, the purpose of fuzzy fingerprints is to hide the true sensitive data in a crowd.
It prevents the DLD provider from learning its exact value.
We describe the technical details next.
III. FUZZY FINGERPRINT METHOD AND PROTOCOL
We describe technical details of our fuzzy fingerprint mechanism
in this section.