26-08-2014, 02:36 PM
Slicing: A New Approach for Privacy Preserving Data Publishing Project Report
Slicing A New Approach.pdf (Size: 1.41 MB / Downloads: 199)
Abstract
Several anonymization techniques, such as generalization and bucketization, have been designed for privacy preserving
microdata publishing. Recent work has shown that generalization loses considerable amount of information, especially for highdimensional
data. Bucketization, on the other hand, does not prevent membership disclosure and does not apply for data that do not
have a clear separation between quasi-identifying attributes and sensitive attributes. In this paper, we present a novel technique called
slicing, which partitions the data both horizontally and vertically. We show that slicing preserves better data utility than generalization
and can be used for membership disclosure protection. Another important advantage of slicing is that it can handle high-dimensional
data. We show how slicing can be used for attribute disclosure protection and develop an efficient algorithm for computing the sliced
data that obey the ‘-diversity requirement. Our workload experiments confirm that slicing preserves better utility than generalization
and is more effective than bucketization in workloads involving the sensitive attribute. Our experiments also demonstrate that slicing
can be used to prevent membership disclosure.
INTRODUCTION
PRIVACY-PRESERVING publishing of microdata has been
studied extensively in recent years. Microdata contains
records each of which contains information about an
individual entity, such as a person, a household, or an
organization. Several microdata anonymization techniques
have been proposed. The most popular ones are generalization
[28], [30] for k-anonymity [30] and bucketization
[34], [26], [17] for ‘-diversity [25]. In both approaches,
attributes are partitioned into three categories: 1) some
attributes are identifiers that can uniquely identify an
individual, such as Name or Social Security Number; 2) some
attributes are Quasi Identifiers (QI), which the adversary may
already know (possibly from other publicly available
databases) and which, when taken together, can potentially
identify an individual, e.g., Birthdate, Sex, and Zipcode;
3) some attributes are Sensitive Attributes (SAs), which are
unknown to the adversary and are considered sensitive,
such as Disease and Salary.
In both generalization and bucketization, one first
removes identifiers from the data and then partitions tuples
into buckets. The two techniques differ in the next step.
Generalization transforms the QI-values in each bucket into
“less specific but semantically consistent” values so that
tuples in the same bucket cannot be distinguished by their
QI values. In bucketization, one separates the SAs from the
QIs by randomly permuting the SA values in each bucket.
Contributions & Organization
In this paper, we present a novel technique called slicing for
privacy-preserving data publishing. Our contributions
include the following.
First, we introduce slicing as a new technique for privacy
preserving data publishing. Slicing has several advantages
when compared with generalization and bucketization. It
preserves better data utility than generalization. It preserves
more attribute correlations with the SAs than bucketization.
It can also handle high-dimensional data and data without a
clear separation of QIs and SAs.
Second, we show that slicing can be effectively used for
preventing attribute disclosure, based on the privacy
requirement of ‘-diversity. We introduce a notion called ‘-
diverse slicing, which ensures that the adversary cannot
learn the sensitive value of any individual with a probability
greater than 1=‘.
Comparison with Generalization
There are several types of recodings for generalization. The
recoding that preserves the most information is local
recoding [36]. In local recoding, one first groups tuples into
buckets and then for each bucket, one replaces all values of
one attribute with a generalized value. Such a recoding is
local because the same attribute value may be generalized
differently when they appear in different buckets.
We now show that slicing preserves more information
than such a local recoding approach, assuming that the
same tuple partition is used. We achieve this by showing
that slicing is better than the following enhancement of the
local recoding approach. Rather than using a generalized
value to replace more specific attribute values, one uses
the multiset of exact values in each bucket. For example,
Table 1b is a generalized table, and Table 1d is the result
of using multisets of exact values rather than generalized
values. For the Age attribute of the first bucket, we use the
multiset of exact values {22, 22, 33, 52} rather than the
generalized interval [22-52]. The multiset of exact values
provides more information about the distribution of values
in each attribute than the generalized interval. Therefore,
using multisets of exact values preserves more information
than generalization.
Privacy Threats
When publishing microdata, there are three types of
privacy disclosure threats. The first type is membership
disclosure. When the data set to be published is selected
from a large population and the selection criteria are
sensitive (e.g., only diabetes patients are selected), one
needs to prevent adversaries from learning whether one’s
record is included in the published data set.
The second type is identity disclosure, which occurs when
an individual is linked to a particular record in the released
table. In some situations, one wants to protect against
identity disclosure when the adversary is uncertain of
membership. In this case, protection against membership
disclosure helps protect against identity disclosure. In other
situations, some adversary may already know that an
individual’s record is in the published data set, in which
case, membership disclosure protection either does not
apply or is insufficient
Diverse Slicing
In the above example, tuple t1 has only one matching
bucket. In general, a tuple t can have multiple matching
buckets. We now extend the above analysis to the general
case and introduce the notion of ‘-diverse slicing.
Consider an adversary who knows all the QI values of t
and attempts to infer t’s sensitive value from the sliced
table. She or he first needs to determine which buckets t
may reside in, i.e., the set of matching buckets of t. Tuple t
can be in any one of its matching buckets. Let pðt;BÞ be the
probability that t is in bucket B (the procedure for
computing pðt;BÞ will be described later in this section).
For example, in the above example, pðt1;B1Þ ¼ 1 and
pðt1;B2Þ ¼ 0.
SLICING ALGORITHMS
We now present an efficient slicing algorithm to achieve
‘-diverse slicing. Given a microdata table T and two
parameters c and ‘, the algorithm computes the sliced table
that consists of c columns and satisfies the privacy
requirement of ‘-diversity.
Our algorithm consists of three phases: attribute partitioning,
column generalization, and tuple partitioning. We now
describe the three phases
Attribute Partitioning
Our algorithm partitions attributes so that highly correlated
attributes are in the same column. This is good for both
utility and privacy. In terms of data utility, grouping highly
correlated attributes preserves the correlations among those
attributes. In terms of privacy, the association of uncorrelated
attributes presents higher identification risks than the
association of highly correlated attributes because the
association of uncorrelated attribute values is much less
frequent and thus more identifiable. Therefore, it is better to
break the associations between uncorrelated attributes, in
order to protect privacy.
Column Generalization
In the second phase, tuples are generalized to satisfy some
minimal frequency requirement. We want to point out that
column generalization is not an indispensable phase in our
algorithm. As shown by Xiao and Tao [34], bucketization
provides the same level of privacy protection as generalization,
with respect to attribute disclosure.
Although column generalization is not a required phase,
it can be useful in several aspects. First, column generalization
may be required for identity/membership disclosure
protection. If a column value is unique in a column
(i.e., the column value appears only once in the column), a
tuple with this unique column value can only have one
matching bucket.
THE NETFLIX PRIZE DATA SET
We emphasized the applicability of slicing on high-dimensional
transactional databases. In this section, we experimentally
evaluate the performance of slicing on the Netflix
Prize data set2 that contains 100,480,507 ratings of
17,770 movies contributed by 480,189 Netflix subscribers.
Each rating has the following format: (userID, movieID,
rating, date), where rating is an integer in f0; 1; 2; 3; 4; 5g
with 0 being the lowest rating and 5 being the highest
rating.
To study the impact of the number of movies and the
number of users on the performance, we choose a subset of
Netflix Prize data as the training data and vary the number
of movies and the number of users. Specifically, we choose
the first nMovies movies, and from all users that have rated
at least one of the nMovies movies, we randomly choose a
fraction fUsers of users. We evaluate the performance of our
approach on this subset of Netflix Prize data.
Methodology. We use the standard SVD-based prediction
method.3 As in Netflix Prize, prediction accuracy is
measured as the rooted-mean-square-error (RMSE).