04-09-2014, 10:45 AM
A Study of Data Perturbation Techniques For Privacy Preserving Data Mining
Data Perturbation.doc (Size: 702 KB / Downloads: 35)
ABSTRACT
In recent years , the data mining techniques have met a serious challenge due to the increased concerning and worries of the privacy , that is, protecting the privacy of the critical and sensitive data. Data perturbation is a popular technique for privacy preserving data mining [13]. The major challenge of data perturbation is balancing privacy protection and data quality, which normally considered as a pair of contradictive factors. Geometric data perturbation technique is a combination of Rotation , Translation and Noise addition perturbation technique. It is especially useful for data owners to publish data while preserving privacy –sensitive information. Typical examples include publishing micro data for research purpose or outsourcing the data to the third party that provides data mining services. In this paper we try to explore the latest trends in Geometric data perturbation technique.
INTRODUCTION
Huge volumes of detailed personal data are regularly collected and analyzed by applications using data mining. Such data include shopping habits, criminal records, medical history, credit records, among others. On the one hand, such data is an important asset to business organizations and governments both to decision making processes and to provide social benifits, such as medical research, crime reduction, national security, etc.[2] The threat to privacy becomes real since data mining techniques are able to derive highly sensitive knowledge from un classified data that is not even known to database holders. Worse is the privacy invasion occasioned by secondary usage of data when individuals are
unaware of “behind the scenes” use of data mining techniques [3]
The challenging problem that: how can we protect against the abuse of the knowledge discovered from secondary usage of data and meet the needs of organizations and governments to support decision making or even to promote social benifits? We claim that a solution for such a problem requires two vital techniques: anonymity to remove identifiers (e.g. names, social insurance numbers, addresses, etc.) in the first phase of privacy protection, and data transformation to protect some sensitive attributes (e.g. salary, age, etc.) since the released data, after removing identifiers, may contain other information that can be linked with other datasets to re-identify individuals or entities [4].
Need For Privacy in Data Mining
Information is today probably the most important and demanded resource. We live in an internetworked society that relies on the dissemination and sharing of information in the private as well as in the public and governmentalsectors. Governmental, public, and private institutions are increasingly required to make their data electronically available[5][6].To protect the privacy of the respondents (individuals, organizations, associations, business establishments, and so on).Although apparently anonymous, the de-identified data may contain other data, such as race, birth date, sex, and ZIP code, which uniquely or almost uniquely pertain to specific respondents (i.e., entities to which data refer) and make them stand out from others[7].By linking these identifying characteristics to publicly available databases associating these characteristics to the respondent's identity, the data recipients can determine to which respondent each piece of released data belongs, or restrict their uncertainty to a specific subset of individuals
DATA PERTURBATION
The methods based on the data-perturbation approach fall into two main categories, which we call the probability distribution category and the fixed data perturbation category [8]. The probability distribution category considers the database to be a sample from a given population that has a given probability distribution. In this case, the security control method replaces the original data by another sample from the same distribution or by the distribution itself. In the fixed data perturbation category, the values of the attributes in the database, which are to be used for computing statistics, are perturbed once and for all. The fixed data perturbation methods have been developed exclusively for either numerical data or categorical data [9].
Within the probability distribution category, two methods can be identified. The first is called “data swap- ping” or “multidimensional transformation” In this method, the original database is replaced with a randomly generated database having approximately the same probability distribution as the original database [10]. As long as a new entity is added or a current entity is deleted, the relationship between this entity and the rest of the database has to be taken into consideration when computing a new perturbation. There is a need for a one-to-one mapping between the original database and the perturbed database. The precision resulting from this method may be considered unacceptable since the method may in some cases have an error of up to 50%. The second is called probability distribution method. The method consists of three steps: (1) Identify the underlying density function of the attribute values and estimate the parameters of this function. (2) Generate a sample series of data from the estimated density function of the confidential
DIFFERENT METHODS OF DATA PERTURBATION
Noise Additive Perturbation
The typical additive perturbation technique[13] is column-based additive randomization. This type of techniques relies on the facts that 1) Data owners may not want to equally protect all values in a record, thus a column-based value distortion can be applied to perturb some sensitive columns. 2) Data classification models to be used do not necessarily require the individual records, but only the column value distributions with the assumption of independent columns. The basic method is to disguise the original values by injecting certain amount of additive random noise, while the specific information, such as the column distribution, can still be effectively reconstructed from the perturbed data.
Condensation-based Perturbation
The condensation approach is a typical multi-dimensional perturbation technique, which aims at preserving the covariance matrix for multiple columns. Thus, some geometric properties such as the shape of decision boundary are well preserved. Different from the randomization approach, it perturbs multiple columns as a whole to generate the entire “perturbed dataset”. As the perturbed dataset preserves the covariance matrix, many existing data mining algorithms can be applied directly to the perturbed dataset without requiring any change or new development of algorithms.
Random Projection Perturbation
Random projection perturbation (Liu, Kargupta and Ryan, 2006) refers to the technique of projecting a set of data points from the original multidimensional space to another randomly chosen space. Let Pk×d be a random projection matrix, where
P’s rows are orthonormal [14
Geometric data perturbation
Def: Geometric data perturbation consists of a sequence of random geometric transformations, including multiplicative transformation ®, translation transformation (Ψ), and distance perturbation ∆.
G(X) = RX + Ψ + ∆ [15]
The data is assumed to be a matrix Apxq, where each of the p rows is an observation, Oi, and each observation contains values for each of the q attributes, Ai. The matrix may contain categorical and numerical attributes. However, our Geometric Data Transformation Methods rely on d numerical attributes, such that d <= q. Thus, the p x d matrix, which is subject to transformation, can be thought of as a vector subspace V in the Euclidean space such that each vector vi€ V is the form vi = (a1; :::; ad),1 <=i<= d, where ai is one instance of Ai, ai€ R, and R is the set of real numbers. The vector subspace V must be transformed before releasing the data for clustering analysis in order to preserve privacy of individual data records. To transform V into a distorted vector subspace V’, we need to add or even multiply a constant noise term e to each element vi of V [15].
Conclusions
We present a random geometric perturbation approach to privacy preserving data classification. Random geometric perturbation, G(X) = RX + Ψ + ∆, includes the linear combination of the three components: rotation perturbation, translation perturbation, and distance perturbation. Geometric perturbation can preserve the important geometric properties, thus most data mining models that search for geometric class boundaries are well preserved with the perturbed data. [18]
Geometric perturbation perturbs multiple columns in one transformation, which introduces new challenges in evaluating the privacy guarantee for multi-dimensional perturbation. We propose a multi-column privacy evaluation model and design unified privacy metric to address these problems. [18]