Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Data Leakage Detection
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Data Leakage Detection


[attachment=66282]

INTRODUCTION

I
N the course of doing business, sometimes sensitive data
must be handed over to supposedly trusted third parties.
For example, a hospital may give patient records to
researchers who will devise new treatments. Similarly, a
company may have partnerships with other companies that
require sharing customer data. Another enterprise may
outsource its data processing, so data must be given to
various other companies. We call the owner of the data the
distributor and the supposedly trusted third parties the
agents. Our goal is to detect when the distributor’s sensitive
data have been leaked by agents, and if possible to identify
the agent that leaked the data.

We consider applications where the original sensitive
data cannot be perturbed. Perturbation is a very useful
technique where the data are modified and made “less
sensitive” before being handed to agents. For example, one
can add random noise to certain attributes, or one can
replace exact values by ranges [18]. However, in some cases,
it is important not to alter the original distributor’s data. For
example, if an outsourcer is doing our payroll, he must have
the exact salary and customer bank account numbers. If
medical researchers will be treating patients (as opposed to
simply computing statistics), they may need accurate data
for the patients

Guilty Agents

Suppose that after giving objects to agents, the distributor
discovers that a set S T has leaked. This means that some
third party, called the target, has been caught in possession
of S. For example, this target may be displaying S on its
website, or perhaps as part of a legal discovery process, the
target turned over S to the distributor.

Since the agents U1; ... ; Un have some of the data, it is
reasonable to suspect them leaking the data. However, the
agents can argue that they are innocent, and that the S data
were obtained by the target through other means. For
example, say that one of the objects in S represents a
customer X. Perhaps X is also a customer of some other
company, and that company provided the data to the target.
Or perhaps X can be reconstructed from various publicly
available sources on the web.

Our goal is to estimate the likelihood that the leaked data
came from the agents as opposed to other sources.
Intuitively, the more data in S, the harder it is for the
agents to argue they did not leak anything. Similarly, the
“rarer” the objects, the harder it is to argue that the target
obtained them through other means. Not only do we want
to estimate the likelihood the agents leaked data, but we
would also like to find out if one of them, in particular, was
more likely to be the leaker. For instance, if one of the
S objects was only given to agent U1, while the other objects
were given to all agents, we may suspect U1 more. The
model we present next captures this intuition

We say an agent Ui is guilty and if it contributes one or
more objects to the target. We denote the event that agent Ui
is guilty by Gi and the event that agent Ui is guilty for a
given leaked set S by GijS. Our next step is to estimate
P rfGijSg, i.e., the probability that agent Ui is guilty given
evidence S


AGENT GUILT MODEL

To compute this P rfGijSg, we need an estimate for the
probability that values in S can be “guessed” by the target.
For instance, say that some of the objects in S are e-mails of
individuals. We can conduct an experiment and ask a
person with approximately the expertise and resources of
the target to find the e-mail of, say, 100 individuals. If this
person can find, say, 90 e-mails, then we can reasonably
guess that the probability of finding one e-mail is 0.9. On
the other hand, if the objects in question are bank account
numbers, the person may only discover, say, 20, leading to
an estimate of 0.2. We call this estimate pt, the probability
that object t can be guessed by the target

GUILT MODEL ANALYSIS

In order to see how our model parameters interact and to
check if the interactions match our intuition, in this
section, we study two simple scenarios. In each scenario,
we have a target that has obtained all the distributor’s
objects, i.e., T ¼ S.

Impact of Probability

In our first scenario, T contains 16 objects: all of them are
given to agent U1 and only eight are given to a second
agent U2. We calculate the probabilities P rfG1jSg and
P rfG2jSg for p in the range [0, 1] and we present the results
in Fig. 1a. The dashed line shows P rfG1jSg and the solid
line shows P rfG2jSg.

As p approaches 0, it becomes more and more unlikely
that the target guessed all 16 values. Each agent has enough
of the leaked data that its individual guilt approaches 1.
However, as p increases in value, the probability that U2 is
guilty decreases significantly: all of U2’s eight objects were
also given to U1, so it gets harder to blame U2 for the leak