DATA LEAKAGE DETECTION

seminar paper · 29-03-2012, 05:00 PM

DATA LEAKAGE DETECTION

.pdf

30.pdf (Size: 562.83 KB / Downloads: 140)

Introduction
In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted
third parties. For example, a hospital may give patient records to researchers who will devise new
treatments. Similarly, a company may have partnerships with other companies that require sharing
customer data. Another enterprise may outsource its data processing, so data must be given to various
other companies. We call the owner of the data the distributor and the supposedly trusted third parties
the agents. Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if
possible to identify the agent that leaked the data.
We consider applications where the original sensitive data cannot be perturbed. Perturbation is a
very useful technique where the data is modified and made “less sensitive” before being handed to
agents. For example, one can add random noise to certain attributes, or one can replace exact values by
ranges [18]. However, in some cases it is important not to alter the original distributor’s data. For example,
if an outsourcer is doing our payroll, he must have the exact salary and customer identification
numbers. If medical researchers will be treating patients (as opposed to simply computing statistics),
they may need accurate data for the patients.
Problem Setup and Notation

2.1 Entities and Agents
A distributor owns a set T = ft1; : : : ; tmg of valuable data objects. The distributor wants to share some
of the objects with a set of agents U1;U2; :::;Un, but does wish the objects be leaked to other third parties.
The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a
database.
An agent Ui receives a subset of objects Ri T, determined either by a sample request or an explicit
request:
Sample request Ri = SAMPLE(T;mi): Any subset of mi records from T can be given to Ui.
Explicit request Ri = EXPLICIT(T; condi): Agent Ui receives all the T objects that satisfy condi.
Example. Say T contains customer records for a given company A. Company A hires a marketing
agency U1 to do an on-line survey of customers. Since any customers will do for the survey, U1 requests
a sample of 1000 customer records. At the same time, company A subcontracts with agent U2 to handle
billing for all California customers. Thus, U2 receives all T records that satisfy the condition “state is
California.”
Although we do not discuss it here, our model can be easily extended to requests for a sample of
objects that satisfy a condition (e.g., an agent wants any 100 California customer records). Also note that
we do not concern ourselves with the randomness of a sample. (We assume that if a random sample
is required, there are enough T records so that the to-be-presented object selection schemes can pick
random records from T.)

2.2 Guilty Agents
Suppose that after giving objects to agents, the distributor discovers that a set S T has leaked. This
means that some third party called the target, has been caught in possession of S. For example, this
target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target
turned over S to the distributor.
Since the agents U1; : : : ;Un have some of the data, it is reasonable to suspect them leaking the data.
However, the agents can argue that they are innocent, and that the S data was obtained by the target
through other means. For example, say one of the objects in S represents a customer X. Perhaps X is
also a customer of some other company, and that company provided the data to the target. Or perhaps
X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not
leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target obtained them
through other means. Not only to we want to estimate the likelihood the agents leaked data, but we
would also like to find out if one of them in particular was more likely to be the leaker. For instance,

if one of the S objects was only given to agent U1, while the other objects were given to all agents, we
may suspect U1 more. The model we present next captures this intuition.
We say an agent Ui is guilty if it contributes one or more objects to the target. We denote the event
that agent Ui is guilty for a given leaked set S by GijS. Our next step is to estimate PrfGijSg, i.e., the
probability that agent Ai is guilty given evidence S.

3 Agent Guilt Model
To compute this PrfGijSg, we need an estimate for the probability that values in S can be “guessed”
by the target. For instance, say some of the objects in t are emails of individuals. We can conduct an
experiment and ask a person with approximately the expertise and resources of the target to find the
email of say 100 individuals. If this person can find say 90 emails, then we can reasonably guess that
the probability of finding one email is 0.9. On the other hand, if the objects in question are bank account
numbers, the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt,
the probability that object t can be guessed by the target.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	Detecting False Data in Wireless Sensor Network using Efficient Becan Scheme	seminar tips	1	3,235	20-09-2017, 01:03 PM Last Post: jaseela123
	Different Initialization Data and the Performance by the BFM	seminar flower	1	680	20-09-2017, 12:44 PM Last Post: jaseela123
	Wide Area Mobile Data Services	seminar ideas	1	2,373	19-09-2017, 02:35 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	Survey of Privacy Protection for Medical Data	project maker	1	649	13-09-2017, 01:14 PM Last Post: jaseela123
	Using Rapid Prototyping Data to Enhance a Knowledge-Based Framework for Product Redes	smart paper boy	1	115,120	13-09-2017, 09:54 AM Last Post: jaseela123
	SECRET DATA HIDING IN IMAGE USING ENCRYPTION AND DECRYPTION KEY PPT	study tips	1	983	09-09-2017, 10:07 AM Last Post: jaseela123
	Automated Duplicate Detection for Bug Tracking Systems	seminar flower	1	1,543	07-09-2017, 09:36 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.