03-12-2012, 05:42 PM
Privacy Preserving Data Mining
data mining final ppt.pptx (Size: 103.94 KB / Downloads: 48)
Difference between security and privacy
Data security, according to common definition is the “confidentiality, integrity and availability” of data.
Privacy, on the other hand, is the appropriate use of information.
Data Mining
Data mining is a recently emerging field , connecting the three worlds of Databases,Artificial Intelligence and Statistics.
The information age has enabled many organizations to gather large volumes of data. However, the usefulness of this data is negligible if “meaningful information” or “knowledge” cannot be extracted from it.
Data mining, otherwise known as knowledge discovery,attempts to answer this need.
Privacy Preserving data mining
Privacy preserving data mining has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes .So people have become increasingly unwilling to share their data, frequently resulting in individuals either refusing to share their data or providing incorrect data.
In recent years, privacy preserving data mining has been studied extensively, because of the wide proliferation of sensitive information on the internet.
The problem of privacy-preserving data mining has become more important in recent years because of the increasing ability to store personal data about users, and the increasing sophistication of data mining algorithms to
leverage this information.
Method of anonymization
When releasing micro data for research purposes, one needs to limit disclosure risks to an acceptable level while maximizing data utility.
To limit disclosure risk, introduced the k-anonymity privacy requirement, which requires each record in an anonymized table to be indistinguishable with at least k other records within the dataset, with respect to a set of quasi-identifier attributes.
To achieve the k-anonymity requirement, they used both generalization and suppression for data anonymization.
ANONYMIZATION TECHNIQUE
Merits :
This method is used to protect respondents' identities while releasing truthful information. While k-anonymity protects against identity disclosure, it does not provide sufficient protection against attribute disclosure.
Demerits:
There are two attacks: the homogeneity attack and the background knowledge attack. Because the limitations of the k-anonymity model stem from the two assumptions. First, it may be very hard for the owner of a database to determine which of the attributes are or are not available in external tables.
The second limitation is that the k-anonymity model assumes a certain method of attack, while in real scenarios there is no reason why the attacker should not try other methods.
Perturbation approach
The perturbation approach works under the need that the data service is not allowed to learn or recover precise records. This restriction naturally leads to some challenges. Since the method does not reconstruct the original data values but only distributions, new algorithms need to be developed which use these reconstructed distributions in order to perform mining of the underlying data.
This means that for each individual data problem such as classification, clustering, or association rule mining, a new distribution based data mining algorithm needs to be developed.
Condensation approach
Condensation approach, which constructs constrained clusters in the data set, and then generates pseudo-data from the statistics of these clusters . We refer to the technique as condensation because of its approach of using condensed statistics of the clusters in order to generate pseudo-data.
This technique called as condensation because of its approach of using condensed statistics of the clusters in order to generate pseudo-data.
Distributed Privacy Preserving Data Mining
The key goal in most distributed methods for privacy-preserving data mining (PPDM) is to allow computation of useful aggregate statistics over the entire data set without compromising the privacy of the individual data sets within the different participants. Thus, the participants may wish to collaborate in obtaining aggregate results, but may not fully trust each other in terms of the distribution of their own data sets. For this purpose, the data sets may either be horizontally partitioned or be vertically partitioned.In horizontally partitioned data sets, the individual records are spread out across multiple entities, each of which has the same set of attributes. In vertical partitioning, the individual entities may have different attributes (or views) of the same set of records.