11-08-2014, 01:16 PM
Privacy Preserving and Data mining
PRIYANKA BALLUNDAGI.doc (Size: 52.5 KB / Downloads: 13)
INTRODUCTION
Data mining and knowledge discovery in databases are two new research areas that investigate the automatic extraction of previously unknown patterns from large amounts of data. Recent advances in data collection, data dissemination and related technologies have inaugurated a new era of research where existing data mining algorithms should be reconsidered from a different point of view, this of privacy preservation. It is well documented that this new without limits explosion of new information through the Internet and other media, has reached to a point where threats against the privacy are very common on a daily basis and they deserve serious thinking.
Privacy preserving data mining, is a novel research direction in data mining and statistical databases, where data mining algorithms are analyzed for the side-effects they incur in data privacy. The main consideration in privacy preserving data mining is twofold. First, sensitive raw data like identifiers, names, addresses and the like should be modified or trimmed out from the original database, in order for the recipient of the data not to be able to compromise another person’s privacy. Second, sensitive knowledge which can be mined from a database by using data mining algorithms should also be excluded, because such knowledge can equally well compromise data privacy. The main objective in privacy preserving data mining is to develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even
Heuristic-Based Techniques
A number of techniques have been developed for a number of data mining techniques like classification, association rule discovery and clustering, based on the premise that selective data modification or sanitization is an NP-Hard problem, and for this reason, heuristics can be used to address the complexity issues
Reconstruction-Based Techniques
A number of recently proposed techniques address the issue of privacy preservation by perturbing the data and reconstructing the distributions at an aggregate level in order to perform the mining. Some of these techniques are listed and classified. A decision tree classifier is carried out from training data in which the values of individual records have been perturbed. While it is not possible to accurately estimate original values in individual data records, the authors propose a reconstruction procedure to accurately estimate the distribution of original data values. By using the reconstructed distributions, they are able to build classifiers
Three Naive Bayes Approaches for Discrimination-Free Classification
The topic of Discrimination-Aware classification was first introduced in Kamiran and Calders, Calders et al. and is motivated by the observation that often training data contains unwanted dependencies between the attributes. Given a labeled dataset and a sensitive attribute; e.g., ethnicity, the goal of our research is to learn a classifier for predicting the class label that does not discriminate w.r.t. the sensitive attribute; e.g., for every ethnic group the probability of being in the positive class should roughly be the same. We call such constraints independency constraints. The paper will be about different techniques of learning and adapting Bayesian classifiers to make them discrimination-aware.
Throughout the paper we will assume that a labeled dataset D is given, with a binary class attribute C which takes values {−, +} and one binary sensitive attribute S which takes values {S−, S+} that has an unwanted correlation with the class attribute. The goal now is to learn a classifier on this data that optimizes predictive accuracy and is subject to the condition that its predictions are non-discriminatory. Discrimination in this paper is measured by the discrimination score, which is defined as the difference P(C = +| S+)− P(C = +| S−).We will concentrate on naive Bayes classifiers. We assume that the sensitive attribute is available for training as well as for prediction. Our contributions in the system are as follows:
Discrimination Prevention in Data Mining for Intrusion and Crime Detection
Discrimination can be viewed as the act of unfairly treating people on the basis of their belonging to a specific group. For instance, individuals may be discriminated because of their race, ideology, gender, etc. In economics and social sciences, discrimination has been studied for over half a century. There are several decision-making tasks which lend themselves to discrimination, e.g. loan granting and staff selection. In the last decades, anti-discrimination laws have been adopted by many democratic governments. Some examples are the US Equal Pay Act, the UK Sex Discrimination Act, the UK Race Relations Act and the EU Directive 2000/43/EC on Anti-discrimination.
Surprisingly, discrimination discovery in information processing did not receive much attention until 2008, even if the use of information systems in decision making is widely deployed. Indeed, decision models are created from real data in order to facilitate decisions in a variety of environments, such as medicine
Discrimination Aware Decision Tree Learning
In this paper we consider the case where we plan to use data mining for decision making, but we suspect that our available historical data contains discrimination. Applying the traditional classification techniques on this data will produce biased models. Due to anti-discriminatory laws or simply due to ethical
concerns the straightforward use of classification techniques is not acceptable. The solution is to develop new techniques which we call discrimination aware – we want to learn a classification model from the potentially biased historical data such that it generates accurate predictions for future decision making, yet does not discriminate with respect to a given discriminatory attribute. The concept of discrimination aware classification can be illustrated with the following example:
DCUBE: Discrimination Discovery in Databases
Civil right laws worldwide prohibit discrimination on the basis of race, color, religion, nationality, sex, marital status, age and pregnancy in a number of settings, including: credit and insurance; sale, rental, and financing of housing; personnel selection and wages; access to public accommodations, education, nursing homes, adoptions, and health care. A general principle is to consider group under-representation as a quantitative measure of the qualitative requirement that people in a group are treated \less favorably" than others, or such that \a higher proportion of people without the attribute comply or are able to comply" to a qualifying criterium. With the advent of automatic decision support systems, such as credit scoring systems, the ease of data collection opens several challenges to data analysts for the fight against discrimination. Discrimination discovery in databases consists in the actual discovery of discriminatory situations and practices hidden in a large amount of historical decision records. The process of data analysis must then be supported by tools that implement legally-grounded measures and reasonings.
Existing System
In sociology, discrimination is the prejudicial treatment of an individual based on their membership in a certain group or category. It involves denying to members of one group opportunities that are available to other groups. There is a list of antidiscrimination acts, which are laws designed to prevent discrimination on the basis of a number of attributes in various settings. For example, the European Union implements the principle of equal treatment between men and women in the access to and supply of goods and services in matters of employment and occupation. Although there are some laws against discrimination, all of them are reactive, not proactive. Technology can add proactivity to legislation by contributing discrimination discovery and prevention techniques.
Services in the information society allow for automatic and routine collection of large amounts of data. Those data are often used to train association/classification rules in view of making automated decisions, like loan granting/denial, insurance premium computation, personnel selection, etc. At first sight, automating decisions may give a sense of fairness: classification rules do not guide themselves by personal preferences. However, at a closer look, one realizes that classification rules are actually learned by the system from the training data. If the training data are inherently biased for or against a particular community, the learned model may show a discriminatory prejudiced behavior. In other words, the system may infer that just being foreign is a legitimate reason for loan denial. Discovering such potential biases and eliminating them from the training data without harming their decision making utility is therefore highly desirable. One must prevent data mining from becoming itself a source of discrimination,
Drawbacks of the Existing System
Drawbacks of the Existing System
Automated data collection and data mining techniques such as classification rule mining are used to making automated decisions. Discriminations are divided into two types such as direct and indirect discriminations. Direct discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions are made based on non sensitive attributes which are strongly correlated with biased sensitive ones. Discrimination discovery and prevention are used for anti-discrimination requirements. Direct and indirect discriminations prevention is applied on individually or both at the same time. The data values are cleaned to obtain direct and/or indirect discriminatory decision rules. Data transformation techniques are applied to prepare the data values for the discrimination prevention. Rule protection and rule generalization algorithm and direct and indirect discrimination prevention algorithm are used to protect discriminations.
SYSTEM IMPLEMENTATION
The discrimination prevention model is integrated with the differential privacy scheme to high privacy. Dynamic policy selection based discrimination prevention is adopted to generalize the systems for all regions. Data transformation technique is improved to increase the utility rate. Discrimination removal process is improved with rule hiding techniques. The discrimination prevention system is designed to protect the decisions that are derived from the rule mining process.
The system is enhanced to improve the data utility rate and privacy preservation rate. Policy selection model is used to perform dynamic policy based discrimination prevention tasks. The system is divided into five major modules. They are data cleaning process, privacy preservation, rule mining, rule hiding and discrimination prevention.
The data cleaning module is designed to prepare the data for mining process. Privacy preservation module is designed to protect sensitive attribute. Frequent pattern mining operations are performed under the rule mining module. Sensitive rules are protected under the rule hiding process. Discrimination prevention module is used to perform direct and indirect discrimination prevention process.
CONCLUSION
The privacy preserved data mining model is designed to protect the discriminations with sensitive and non sensitive attributes. The entire system is planned in two phases. The first phase is allocated to carry out the study and analysis for the existing system. The domain knowledge is collected and analyzed in the introductory levels. A wide literature survey is conducted to analyze the techniques and concepts that proposed earlier. The literature survey is conducted in the area of Discrimination detection intrusions, protection for indirect discrimination, classification with discrimination prevention and classification with discrimination removal. All the merits and demerits are analyzed. The existing system and its problems are extracted from the literature survey. The design of the proposed system is prepared to solve the problems in the existing system. Module level development procedures are also finalized in the first phase.
The second phase of the project work is planned with development and implementation activities. The system development, testing, implementation and documentation tasks are scheduled in the second phase. A performance analysis with the existing system is also planned in the second phase.