29-03-2012, 05:00 PM
DATA LEAKAGE DETECTION
30.pdf (Size: 562.83 KB / Downloads: 140)
Introduction
In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted
third parties. For example, a hospital may give patient records to researchers who will devise new
treatments. Similarly, a company may have partnerships with other companies that require sharing
customer data. Another enterprise may outsource its data processing, so data must be given to various
other companies. We call the owner of the data the distributor and the supposedly trusted third parties
the agents. Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if
possible to identify the agent that leaked the data.
We consider applications where the original sensitive data cannot be perturbed. Perturbation is a
very useful technique where the data is modified and made “less sensitive” before being handed to
agents. For example, one can add random noise to certain attributes, or one can replace exact values by
ranges [18]. However, in some cases it is important not to alter the original distributor’s data. For example,
if an outsourcer is doing our payroll, he must have the exact salary and customer identification
numbers. If medical researchers will be treating patients (as opposed to simply computing statistics),
they may need accurate data for the patients.
Problem Setup and Notation
2.1 Entities and Agents
A distributor owns a set T = ft1; : : : ; tmg of valuable data objects. The distributor wants to share some
of the objects with a set of agents U1;U2; :::;Un, but does wish the objects be leaked to other third parties.
The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a
database.
An agent Ui receives a subset of objects Ri T, determined either by a sample request or an explicit
request:
Sample request Ri = SAMPLE(T;mi): Any subset of mi records from T can be given to Ui.
Explicit request Ri = EXPLICIT(T; condi): Agent Ui receives all the T objects that satisfy condi.
Example. Say T contains customer records for a given company A. Company A hires a marketing
agency U1 to do an on-line survey of customers. Since any customers will do for the survey, U1 requests
a sample of 1000 customer records. At the same time, company A subcontracts with agent U2 to handle
billing for all California customers. Thus, U2 receives all T records that satisfy the condition “state is
California.”
Although we do not discuss it here, our model can be easily extended to requests for a sample of
objects that satisfy a condition (e.g., an agent wants any 100 California customer records). Also note that
we do not concern ourselves with the randomness of a sample. (We assume that if a random sample
is required, there are enough T records so that the to-be-presented object selection schemes can pick
random records from T.)
2.2 Guilty Agents
Suppose that after giving objects to agents, the distributor discovers that a set S T has leaked. This
means that some third party called the target, has been caught in possession of S. For example, this
target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target
turned over S to the distributor.
Since the agents U1; : : : ;Un have some of the data, it is reasonable to suspect them leaking the data.
However, the agents can argue that they are innocent, and that the S data was obtained by the target
through other means. For example, say one of the objects in S represents a customer X. Perhaps X is
also a customer of some other company, and that company provided the data to the target. Or perhaps
X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not
leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target obtained them
through other means. Not only to we want to estimate the likelihood the agents leaked data, but we
would also like to find out if one of them in particular was more likely to be the leaker. For instance,
if one of the S objects was only given to agent U1, while the other objects were given to all agents, we
may suspect U1 more. The model we present next captures this intuition.
We say an agent Ui is guilty if it contributes one or more objects to the target. We denote the event
that agent Ui is guilty for a given leaked set S by GijS. Our next step is to estimate PrfGijSg, i.e., the
probability that agent Ai is guilty given evidence S.
3 Agent Guilt Model
To compute this PrfGijSg, we need an estimate for the probability that values in S can be “guessed”
by the target. For instance, say some of the objects in t are emails of individuals. We can conduct an
experiment and ask a person with approximately the expertise and resources of the target to find the
email of say 100 individuals. If this person can find say 90 emails, then we can reasonably guess that
the probability of finding one email is 0.9. On the other hand, if the objects in question are bank account
numbers, the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt,
the probability that object t can be guessed by the target.
30.pdf (Size: 562.83 KB / Downloads: 140)
Introduction
In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted
third parties. For example, a hospital may give patient records to researchers who will devise new
treatments. Similarly, a company may have partnerships with other companies that require sharing
customer data. Another enterprise may outsource its data processing, so data must be given to various
other companies. We call the owner of the data the distributor and the supposedly trusted third parties
the agents. Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if
possible to identify the agent that leaked the data.
We consider applications where the original sensitive data cannot be perturbed. Perturbation is a
very useful technique where the data is modified and made “less sensitive” before being handed to
agents. For example, one can add random noise to certain attributes, or one can replace exact values by
ranges [18]. However, in some cases it is important not to alter the original distributor’s data. For example,
if an outsourcer is doing our payroll, he must have the exact salary and customer identification
numbers. If medical researchers will be treating patients (as opposed to simply computing statistics),
they may need accurate data for the patients.
Problem Setup and Notation
2.1 Entities and Agents
A distributor owns a set T = ft1; : : : ; tmg of valuable data objects. The distributor wants to share some
of the objects with a set of agents U1;U2; :::;Un, but does wish the objects be leaked to other third parties.
The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a
database.
An agent Ui receives a subset of objects Ri T, determined either by a sample request or an explicit
request:
Sample request Ri = SAMPLE(T;mi): Any subset of mi records from T can be given to Ui.
Explicit request Ri = EXPLICIT(T; condi): Agent Ui receives all the T objects that satisfy condi.
Example. Say T contains customer records for a given company A. Company A hires a marketing
agency U1 to do an on-line survey of customers. Since any customers will do for the survey, U1 requests
a sample of 1000 customer records. At the same time, company A subcontracts with agent U2 to handle
billing for all California customers. Thus, U2 receives all T records that satisfy the condition “state is
California.”
Although we do not discuss it here, our model can be easily extended to requests for a sample of
objects that satisfy a condition (e.g., an agent wants any 100 California customer records). Also note that
we do not concern ourselves with the randomness of a sample. (We assume that if a random sample
is required, there are enough T records so that the to-be-presented object selection schemes can pick
random records from T.)
2.2 Guilty Agents
Suppose that after giving objects to agents, the distributor discovers that a set S T has leaked. This
means that some third party called the target, has been caught in possession of S. For example, this
target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target
turned over S to the distributor.
Since the agents U1; : : : ;Un have some of the data, it is reasonable to suspect them leaking the data.
However, the agents can argue that they are innocent, and that the S data was obtained by the target
through other means. For example, say one of the objects in S represents a customer X. Perhaps X is
also a customer of some other company, and that company provided the data to the target. Or perhaps
X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not
leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target obtained them
through other means. Not only to we want to estimate the likelihood the agents leaked data, but we
would also like to find out if one of them in particular was more likely to be the leaker. For instance,
if one of the S objects was only given to agent U1, while the other objects were given to all agents, we
may suspect U1 more. The model we present next captures this intuition.
We say an agent Ui is guilty if it contributes one or more objects to the target. We denote the event
that agent Ui is guilty for a given leaked set S by GijS. Our next step is to estimate PrfGijSg, i.e., the
probability that agent Ai is guilty given evidence S.
3 Agent Guilt Model
To compute this PrfGijSg, we need an estimate for the probability that values in S can be “guessed”
by the target. For instance, say some of the objects in t are emails of individuals. We can conduct an
experiment and ask a person with approximately the expertise and resources of the target to find the
email of say 100 individuals. If this person can find say 90 emails, then we can reasonably guess that
the probability of finding one email is 0.9. On the other hand, if the objects in question are bank account
numbers, the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt,
the probability that object t can be guessed by the target.