27-10-2012, 05:30 PM
DATA LEAKAGE DETECTION
DATA-LEAKAGE-DETECTION.pdf (Size: 199.9 KB / Downloads: 76)
ABSTRACT
Modern business activities rely on extensive email exchange. Email leakages have become widespread, and the
severe damage caused by such leakages constitutes a disturbing problem for organizations. We study the
following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third
parties). If the data distributed to third parties is found in a public/private domain then finding the guilty party
is a nontrivial task to distributor. Traditionally, this leakage of data is handled by water marking technique
which requires modification of data. If the watermarked copy is found at some unauthorized site then distributor
can claim his ownership. To overcome the disadvantages of using watermark [2], data allocation strategies are
used to improve the probability of identifying guilty third parties. The distributor must assess the likelihood that
the leaked came from one or more agents, as opposed to having been independently gathered by other means. In
this project, we implement and analyze a guilt model that detects the agents using allocation strategies without
modifying the original data. The guilty agent is one who leaks a portion of distributed data. We propose data
allocation strategies that improve the probability of identifying leakages. In some cases we can also inject
“realistic but fake” data record to further improve our changes of detecting leakage and identifying the guilty
party. The algorithms implemented using fake objects will improve the distributor chance of detecting guilty
agents. It is observed that by minimizing the sum objective the chance of detecting guilty agents will increase.
We also developed a framework for generating fake objects.
INTRODUCTION
Demanding market conditions encourage many companies to outsource certain business processes
(e.g. marketing, human resources) and associated activities to a third party. This model is referred as
Business Process Outsourcing (BPO) and it allows companies to focus on their core competency by
subcontracting other activities to specialists, resulting in reduced operational costs and increased
productivity. Security and business assurance are essential for BPO. In most cases, the service
providers need access to a company's intellectual property and other confidential information to carry
out their services. For example a human resources BPO vendor may need access to employee
databases with sensitive information (e.g. social security numbers), a patenting law firm to some
research results, a marketing service vendor to the contact information for customers or a payment
service provider may need access to the credit card numbers or bank account numbers of customers.
The main security problem in BPO is that the service provider may not be fully trusted or may not be
securely administered. Business agreements for BPO try to regulate how the data will be handled by
service providers, but it is almost impossible to truly enforce or verify such policies across different
administrative domains. Due to their digital nature, relational databases are easy to duplicate and in
many cases a service provider may have financial incentives to redistribute commercially valuable
data or may simply fail to handle it properly. Hence, we need powerful techniques that can detect and
deter such dishonest.
PROBLEM DEFINITION
Suppose a distributor owns a set T = {t1 ,tm } of valuable data objects. The distributor wants to share
some of the objects with a set of agents U1,U2 ,… ,Un but does wish the objects be leaked to other
third parties. An agent Ui receives a subset of objects Ri which belongs to T, determined either by a
sample request or an explicit request,
Sample Request Ri = SAMPLE ( T,mi ) : Any subset of mi records from T can be given to Ui.
Explicit Request Ri = EXPLICIT ( T,condi ) : Agent Ui receives all the T objects that satisfy condi .
The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a
database. After giving objects to agents, the distributor discovers that a set S of T has leaked. This
means that some third party called the target has been caught in possession of S. For example, this
target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target
turned over S to the distributor. Since the agents U1,U2 ,… ,Un, have some of the data, it is reasonable
to suspect them leaking the data. However, the agents can argue that they are innocent, and that the S
data was obtained by the target through other means.
Agent Guilt Model
Suppose an agent Ui is guilty if it contributes one or more objects to the target. The event that agent
Ui is guilty for a given leaked set S is denoted by Gi | S. The next step is to estimate Pr {Gi | S }, i.e.,
the probability that agent Gi is guilty given evidence S.
To compute the Pr {Gi | S}, estimate the probability that values in S can be “guessed” by the target.
For instance, say some of the objects in t are emails of individuals. Conduct an experiment and ask a
person to find the email of say 100 individuals, the person may only discover say 20, leading to an
estimate of 0.2. Call this estimate as Pt, the probability that object t can be guessed by the target.
The two assumptions regarding the relationship among the various leakage events.
Assumption 1: For all t, t ∈ S such that t t the provenance of t is independent of the provenance of t.
The term provenance in this assumption statement refers to the source of a value t that appears in the
leaked set. The source can be any of the agents who have t in their sets or the target itself.
Assumption 2: An object t ∈ S can only be obtained by the target in one of two ways.