21-12-2012, 05:51 PM
Slicing: A New Approach to Privacy Preserving Data Publishing
Slicing.pdf (Size: 252.55 KB / Downloads: 29)
ABSTRACT
Several anonymization techniques, such as generalization
and bucketization, have been designed for privacy preserving
microdata publishing. Recent work has shown that general-
ization loses considerable amount of information, especially
for high-dimensional data. Bucketization, on the other hand,
does not prevent membership disclosure and does not apply
for data that do not have a clear separation between quasi-
identifying attributes and sensitive attributes.
In this paper, we present a novel technique called slicing,
which partitions the data both horizontally and vertically.
We show that slicing preserves better data utility than gen-
eralization and can be used for membership disclosure pro-
tection. Another important advantage of slicing is that it
can handle high-dimensional data. We show how slicing can
be used for attribute disclosure protection and develop an ef-
ficient algorithm for computing the sliced data that obey the
ℓ-diversity requirement. Our workload experiments confirm
that slicing preserves better utility than generalization and
is more effective than bucketization in workloads involving
the sensitive attribute. Our experiments also demonstrate
that slicing can be used to prevent membership disclosure.
INTRODUCTION
Privacy-preserving publishing of microdata has been stud-
ied extensively in recent years. Microdata contains records
each of which contains information about an individual en-
tity, such as a person, a household, or an organization.
Several microdata anonymization techniques have been pro-
posed. The most popular ones are generalization [29, 31]
for k-anonymity [31] and bucketization [35, 25, 16] for ℓ-
diversity [23]. In both approaches, attributes are partitioned
into three categories: (1) some attributes are identifiers that
can uniquely identify an individual, such as Name or Social
Security Number; (2) some attributes are Quasi-Identifiers
(QI), which the adversary may already know (possibly from
other publicly-available databases) and which, when taken
together.
Motivation of Slicing
It has been shown [1, 15, 35] that generalization for k-
anonymity losses considerable amount of information, espe-
cially for high-dimensional data. This is due to the following
three reasons. First, generalization for k-anonymity suffers
from the curse of dimensionality. In order for generalization
to be effective, records in the same bucket must be close to
each other so that generalizing the records would not lose too
much information. However, in high-dimensional data, most
data points have similar distances with each other, forcing a
great amount of generalization to satisfy k-anonymity even
for relative small k’s. Second, in order to perform data
analysis or data mining tasks on the generalized table, the
data analyst has to make the uniform distribution assump-
tion that every value in a generalized interval/set is equally
possible, as no other distribution assumption can be justi-
fied. This significantly reduces the data utility of the gen-
eralized data. Third, because each attribute is generalized
separately, correlations between different attributes are lost.
In order to study attribute correlations on the generalized
table, the data analyst has to assume that every possible
combination of attribute values is equally possible. This is
an inherent problem of generalization that prevents effective
analysis of attribute correlations.
Contributions & Organization
In this paper, we present a novel technique called slicing
for privacy-preserving data publishing. Our contributions
include the following.
First, we introduce slicing as a new technique for privacy
preserving data publishing. Slicing has several advantages
when compared with generalization and bucketization. It
preserves better data utility than generalization. It pre-
serves more attribute correlations with the SAs than bucke-
tization. It can also handle high-dimensional data and data
without a clear separation of QIs and SAs.
Second, we show that slicing can be effectively used for
preventing attribute disclosure, based on the privacy re-
quirement of ℓ-diversity. We introduce a notion called ℓ-
diverse slicing, which ensures that the adversary cannot
learn the sensitive value of any individual with a probability
greater than 1/ℓ.
Third, we develop an efficient algorithm for computing
the sliced table that satisfies ℓ-diversity. Our algorithm par-
titions attributes into columns, applies column generaliza-
tion, and partitions tuples into buckets. Attributes that are
highly-correlated are in the same column; this preserves the
correlations between such attributes.
Privacy Threats
When publishing microdata, there are three types of pri-
vacy disclosure threats. The first type is membership disclo-
sure. When the dataset to be published is selected from a
large population and the selection criteria are sensitive (e.g.,
only diabetes patients are selected), one needs to prevent ad-
versaries from learning whether one’s record is included in
the published dataset.
The second type is identity disclosure, which occurs when
an individual is linked to a particular record in the released
table. In some situations, one wants to protect against iden-
tity disclosure when the adversary is uncertain of member-
ship. In this case, protection against membership disclo-
sure helps protect against identity disclosure. In other sit-
uations, some adversary may already know that an indi-
vidual’s record is in the published dataset, in which case,
membership disclosure protection either does not apply or
is insufficient.
The third type is attribute disclosure, which occurs when
new information about some individuals is revealed, i.e., the
released data makes it possible to infer the attributes of an
individual more accurately than it would be possible before
the release. Similar to the case of identity disclosure, we
need to consider adversaries who already know the mem-
bership information. Identity disclosure leads to attribute
disclosure. Once there is identity disclosure, an individual
is re-identified and the corresponding sensitive value is re-
vealed. Attribute disclosure can occur with or without iden-
tity disclosure, e.g., when the sensitive values of all matching
tuples are the same.
SLICING ALGORITHMS
We now present an efficient slicing algorithm to achieve
ℓ-diverse slicing. Given a microdata table T and two param-
eters c and ℓ, the algorithm computes the sliced table that
consists of c columns and satisfies the privacy requirement
of ℓ-diversity.
Our algorithm consists of three phases: attribute parti-
tioning, column generalization, and tuple partitioning. We
now describe the three phases.
Attribute Partitioning
Our algorithm partitions attributes so that highly-
correlated attributes are in the same column. This is good
for both utility and privacy. In terms of data utility, group-
ing highly-correlated attributes preserves the correlations
among those attributes. In terms of privacy, the association
of uncorrelated attributes presents higher identification risks
than the association of highly-correlated attributes because
the association of uncorrelated attribute values is much less
frequent and thus more identifiable. Therefore, it is better
to break the associations between uncorrelated attributes,
in order to protect privacy.
RELATED WORK
Two popular anonymization techniques are generalization
and bucketization. Generalization [29, 31, 30] replaces a
value with a “less-specific but semantically consistent” value.
Three types of encoding schemes have been proposed for
generalization: global recoding, regional recoding, and local
recoding. Global recoding has the property that multiple
occurrences of the same value are always replaced by the
same generalized value. Regional record [17] is also called
multi-dimensional recoding (the Mondrian algorithm) which
partitions the domain space into non-intersect regions and
data points in the same region are represented by the region
they are in. Local recoding does not have the above con-
straints and allows different occurrences of the same value
to be generalized differently.
Bucketization [35, 25, 16] first partitions tuples in the
table into buckets and then separates the quasi-identifiers
with the sensitive attribute by randomly permuting the sen-
sitive attribute values in each bucket. The anonymized
data consists of a set of buckets with permuted sensitive
attribute values. In particular, bucketization has been used
for anonymizing high-dimensional data [12]. Please refer to
Section 2.2 and Section 2.3 for a detailed comparison of slic-
ing with generalization and bucketization, respectively.