13-06-2012, 12:11 PM
DATA LEAKAGE DETECTION
Data Leakage Detection.doc (Size: 62 KB / Downloads: 41)
AIM
Aim is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.
ABSTRACT
A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Data allocation strategies (across the agents) that improve the probability of identifying leakages has been proposed. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases distributor can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.
OBJECTIVE
A data distributor has given sensitive data to a set of supposedly trusted agents (third parties).
Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop).
The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.
Data allocation strategies (across the agents) that improve the probability of identifying leakages.
These methods do not rely on alterations of the released data (e.g., watermarks). In some cases distributor can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.
Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.
EXISTING SYSTEM
Perturbation
Application where the original sensitive data cannot be perturbed has been considered. Perturbation is a very useful technique where the data is modified and made “less sensitive” before being handed to agents. For example, one can add random noise to certain attributes, or one can replace exact values by ranges. However, in some cases it is important not to alter the original distributor’s data. For example, if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.
Watermarking
Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.
Disadvantages
Consider applications where the original sensitive data cannot be perturbed. Perturbation is a very useful technique where the data is modified and made “less sensitive” before being handed to agents.
However, in some cases it is important not to alter the original distributor’s data.
Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy.
If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified.
Watermarks can be very useful in some cases, but again, involve some modification of the original data.
Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.
PROPOSED SYSTEM
A model for assessing the “guilt” of agents has been developed. An algorithm for distributing objects to agents, in a way that improves our chances of identifying a leaker has been proposed. The option of adding “fake” objects to the distributed set also been considered. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects acts as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.
Advantages
After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place.
At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.
If the distributor sees “enough evidence” that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.
To develop a model for assessing the “guilt” of agents.
We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leaker.
Consider the option of adding “fake” objects to the distributed set. Such objects do not correspond to real entities but appear.
If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.
Algorithm Steps
Step: 1 Distributor gets request from agent
The distributor gives requested data to agents.
Step: 2 Distributor creates fake object and allocates it to the agent
The distributor can create one fake object (B = 1) and both agents can receive one fake object (b1 = b2 = 1). If the distributor is able to create more fake objects, he could further improve the objective.
Step: 3 check number of agents, who have already received data
Distributor checks the number of agents, who have already received data.
Step: 4 Check for remaining agents
Distributor chooses the remaining agents to send the data. Distributor can increase the number of possible allocations by adding fake object.
Step: 5 Select fake object again to allocate for remaining agents
Distributor chooses the random fake object to allocate for the remaining agents.
Step; 6 Estimate the probability value for guilt agent
To compute this probability, we need an estimate for the probability that values can be “guessed” by the target.
SYSTEM ARCHITECTURE
Methodology
The project is developed in the following stages
* Analysis – analysis of the customer need
* Design – design of the desired solution
* Development – (technical) development of the solution
* Implementation – deployment of the developed solution in the organization
* Evaluation – evaluation the implemented solution
Waterfall development
The Waterfall model is a sequential development approach, in which development is seen as flowing steadily downwards (like a waterfall) through the phases of requirements analysis, design, implementation, testing (validation), integration, and maintenance.
The basic principles are: * Project is divided into sequential phases, with some overlap and splashback acceptable between phases. * Emphasis is on planning, time schedules, target dates, budgets and implementation of an entire system at one time. * Tight control is maintained over the life of the project via extensive written documentation, formal reviews, and approval/signoff by the user and information technology management occurring at the end of most phases before beginning the next phase.
Integrated development environment
An integrated development environment (IDE) also known as integrated design environment or integrated debugging environment is a software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of a:
* source code editor,
* compiler and/or interpreter,
* build automation tools, and
* debugger (usually).
IDEs are designed to maximize programmer productivity by providing tight-knit components with similar user interfaces. Typically an IDE is dedicated to a specific programming language, so as to provide a feature set which most closely matches the programming paradigms of the language.
The IDE used in our project work is Netbeans.