16-07-2014, 02:53 PM
A New Approach for Privacy Preserving Data Publishing
A New Approach for Privacy.docx (Size: 1.11 MB / Downloads: 10)
INTRODUCTION
In privacy preserving publishing of microdata, microdata contains records each of which contains individual information about an entity such as a person, an organization, a household. Several microdata anonymisation techniques have been proposed. The most popular techniques are:
• Generalization
• Bucketization
In both approach attributes are partitioned into three catagories:
Some attributes are identifiers that can uniquely identify individual such as SSN or name.
Some attributes are quasi-identifiers , which the adversary may already known, by using these identifiers we can potentially identify an individual e.g., Birthdate, Gender , Zipcode.
Some attributes are sensitive attributes which are unknown to the adversary such as Salary, Disease.
IN both generalization and bucketization ,first removes identifiers from the data and then partition the tuples into buckets.In the second step generalization transforms the QI-values in each bucket into “less specific but semantically consistent” values so that tuples in the same bucket cannot be distinguished by their QI values. In bucketization , one separates the SAs from the QIs by randomly arranging the SA values in each bucket.
Motivation of Slicing
It has been shown that generalization for k-anonymity loses certain amount of information ,especially for high dimensional data, for the following reasons.
1. Generalization for k-anonymity suffers from the curse of dimensionality. In order for generalization to be effective ,records in the same bucket must be close to each other so that record would not lose too much information.
2. In order to perform data analysis or data mining tasks on the generalized table, the data analyst has to make assumption that every value in the generalized set equally possible. This significantly reduces the data utility of the generalized data.
3. Because each attribute is generalized separately , correlation between different attributes are lost. This is an inherent problem of generalization that prevents effective analysis of attribute correlations.
While bucketization has better data utility than generalization. It has following limitations.
1. Bucketization does not prevent membership disclosure because bucketization publishes their QI values in their original form, so it is possible for the adversary to find out whether an individual has a record in the published data or not.
2. Bucketization requires a clear separation between QIs and SAs. In many data sets it is unclear that which are QIs and which are SAs.
3. By separating the sensitive attributes from the QI attributes ,bucketization breaks the attribute correlations between the QIs and SAs.
So we introduce a new data anonymization technique called slicing. Slicing partition the data horizontally and vertically. Vertical partition is done by grouping attributes into columns based on the correlations among the attributes. Each column contains a subset of the attribute that are highly correlated . Horizontal partition is done by grouping tuples into buckets, within
The basic idea of slicing is to break the connection cross columns, but to preserve the association within each column. This reduces the dimensionality of the data and preserves better utility than generalization and bucketization because it groups highly correlated attributes together, and preserves the correlation between such attributes. Slicing protects privacy because it breaks the associations between uncorrelated attributes, which are infrequent and thus identifying.
Formalization of Slicing
Let T be the microdata table to be published. T contains d attributes: A={A1,A2,……..Ad} and their attribute domains are {D[A1], D[A2],……….D[Ad]}. A tuple t<-T cab be represented as t=(t[A1],t[A2],…….t[Ad]) where t[Ai] is the Ai value of t.
Tuple partition and buckets
A tuple partition consists of several subsets of T, such that each tuple belongs to exactly one subset. Each subset of tuples is called a bucket. Especially let there be b buckets B1,B2,…..Bb then Bi1 and Bi2 should not be empty
Column Generalization
Given a microdata table T and a column Ci={Ai1,Ai2,………Aij} where Ai1,Ai2,….Aij are attributes, a column generalization for Ci is defined as a set of nonoverlapping j-dimensional regions that completely covers D[Ai1]xD[Ai2]x…..xD[Aij].
Comparison with generalization
There are several recording types for generalization. The most common is local recording, it preserves more information. In loca recording , first groups tuples into buckets and then for each bucket , one replaces all values of one attribute with a generalized value. Such a recording is local because the same attribute value may be generalized differently when they appear in different buckets.
Slicing works better than local recording approach. Rather than using a generalized value to replace more specific attribute values, one uses the multi-set of exact values in each bucket. Table 1d is the result of using multi-sets of exact values rather than the generalized values. For example Age attribute of the first bucket, we use multi-set of exact values {22,22,33,52} rather than [22-52]. The multi-set of exact values provides more information than generalized interval.
Multi-set based generalization is equivalent to a one-attribute-per-column slicing because both approaches preserve the exact values in each attribute. For example Table 1e is equivalent to Table 1d. By comparing sliced Table 1e with 1f, we observed that ,ono-attribute-per-column slicing preserves distributional information, it does not preserve attribute correlation ,because each attribute is in its own column. In slicing, one groups correlated attributes together in one column and preserve their correlation. Example correlation between Age and Sex and correlation between Zipcode and Disease is preserved in Table 1f.
Comparison with Bucketization
Bucketization is a special case of slicing ,where there are exactly two columns: one column contains only SA and other contains all the QIs. The advantage of slicing over bucketization are as follows: First, by partitioning attribute into more than two columns, slicing can be used to prevent membership disclosure, bucketization does not prevent membership disclosure.
Second, unlike bucketization ,which requires a clear separation between QIs and SAs, slicing can be used without such separation. For data sets such as microdata , one often cannot clearly separate QIs from SAs because there is no single external public database that one can used to determine which attributes the adversary already known. For such data slicing can be used
Attribute Partitioning
Our algorithm partitions attributes so that highly correlated attributes are in the same column. This is good for both utility and privacy. In terms of data utility, grouping highly correlated attributes preserves the correlation among those attributes.in terms of privacy, the association of uncorrelated attributes presents higher identification risks than the association of highly correlated attributes because the association of uncorrelated attribute values is much less frequent and thus identifiable. Therefore it is better to break the associations between uncorrelated attributes
MEMBERSHIP DISCLOSURE PROTECTION
In this section, we analyze how slicing can provide membership disclosure protection.
Bucketization. Bucketization releases each tuple’s combination of QI values in their original form and most individuals can be uniquely identified using the QI values, the adversary can determine the membership of an individual in the original data by examining whether the individual’s combination of QI values occurs in the released data.
Slicing. Slicing offers protection against membership disclosure because QI attributes are partitioned into different columns and correlations among different columns within each bucket are broken. Consider the sliced table in Table 1f. The table has two columns. The first bucket is resulted from four tuples; we call them the original tuples. The bucket matches all together 4^2=16 tuples, 4 original tuples and 12 that do not appear in the original table. We call these 12 tuples fake tuples. Given any tuple, if it has no matching bucket in the sliced table, then we know for sure that the tuple is not in the original table. However, even if a tuple has one or more matching bucket, one cannot tell whether the tuple is in the original table, because it could be a fake tuple.
We propose two quantitative measures for the degree of membership protection offered by slicing. The first is the fake-original ratio (FOR), which is defined as the number of
fake tuples divided by the number of original tuples. Intuitively, the larger the FOR, the more membership protection is provided. A sliced bucket of size k can potentially match kc tuples, including k original tuples k^c- k fake tuples; hence, the FOR is k^(c-1). When one
has chosen a minimal threshold for the FOR, one can choose k and c appropriately to satisfy the threshold.
The second measure is to consider the number of matching buckets for original tuples and that for fake tuples. If they are similar enough, membership information is protected because the adversary cannot distinguish original tuples from fake tuples.
CONCLUSION AND FUTURE WORK
Slicing is a new approach for privacy preserving data publishing. Slicing overcomes the
limitations of generalization and bucketization and preserves better utility while protecting against privacy threats. We use slicing to prevent attribute disclosure and membership disclosure. Our experiments show that slicing preserves better data utility than generalization and is more effective than bucketization in workloads involving the sensitive attribute.
This work motivates several directions for future research. we consider slicing where each attribute is in exactly one column. An extension is the notion of overlapping slicing, which duplicates an attribute in more than one columns. This releases more attribute correlations. For
example, in Table 1f, one could choose to include the Disease attribute also in the first column. That is, the two columns are fAge; Sex; Diseaseg and fZipcode; Diseaseg. This could provide better data utility, but the privacy implications need to be carefully studied and understood. It is interesting to study the trade-off between privacy and utility.
Slicing is a promising technique for handling high-dimensional data. By partitioning attributes into columns, we protect privacy by breaking the association of uncorrelated attributes and preserve data utility by preserving the association between highly correlated attributes. For example, slicing can be used for anonymizing transaction databases.