18-04-2013, 02:49 PM
Anonymous Publication of Sensitive Transactional Data
Anonymous Publication.pdf (Size: 2.58 MB / Downloads: 27)
Abstract
Existing research on privacy-preserving data publishing focuses on relational data: in this context, the objective is to
enforce privacy-preserving paradigms, such as k-anonymity and ‘-diversity, while minimizing the information loss incurred in the
anonymizing process (i.e., maximize data utility). Existing techniques work well for fixed-schema data, with low dimensionality.
Nevertheless, certain applications require privacy-preserving publishing of transactional data (or basket data), which involve hundreds
or even thousands of dimensions, rendering existing methods unusable. We propose two categories of novel anonymization methods
for sparse high-dimensional data. The first category is based on approximate nearest-neighbor (NN) search in high-dimensional
spaces, which is efficiently performed through locality-sensitive hashing (LSH). In the second category, we propose two data
transformations that capture the correlation in the underlying data: 1) reduction to a band matrix and 2) Gray encoding-based sorting.
These representations facilitate the formation of anonymized groups with low information loss, through an efficient linear-time heuristic.
We show experimentally, using real-life data sets, that all our methods clearly outperform existing state of the art. Among the proposed
techniques, NN-search yields superior data utility compared to the band matrix transformation, but incurs higher computational
overhead. The data transformation based on Gray code sorting performs best in terms of both data utility and execution time.
INTRODUCTION
ORGANIZATIONS, such as hospitals, publish detailed data
(microdata) about individuals (e.g., medical records) for
research or statistical purposes. However, sensitive personal
information may be disclosed in this process, due to the
existence in the data of quasi-identifying attributes (QID),
such as age, zip code, etc. An attacker can join the QID with
external information, such as voting registration lists, to
reidentify individual records.
Existing privacy-preserving techniques focus on anonymizing
personal data, which have a fixed schema with a
small number of dimensions. Through generalization or
suppression, existing methods prevent attackers from
reidentifying individual records.
Motivation
Consider the example in Fig. 1a, which shows the contents
of five purchase transactions (the customer name is not
disclosed, we include it just for ease of presentation). The
sensitive products (items), which are considered to be a
privacy breach if associated with a certain individual, are
shown shaded. The rest of the items, which are nonsensitive,
can be used by an attacker to reidentify individual
transactions, similarly to a quasi-identifier, with the
distinctive characteristic that the number of potentially
identifying items is very large in practice (hence, the QID
has very high dimensionality). Consider the transaction of
Claire, who bought a pregnancy test. An attacker (Eve) may
easily learn about some of the items purchased by Claire on
a certain day, possibly from a conversation with her, or
from knowing some of her personal preferences. For
instance, Claire may treat her guests, including Eve, with
fresh cream and strawberries, and Eve can, therefore, infer
that Claire must have purchased these items recently.
Contributions
We propose two categories of anonymization techniques
that combine the advantages of both generalization and
permutation, and also address the difficult challenge of
high dimensionality: 1) methods that rely on efficient
nearest-neighbor (NN) search in high-dimensional spaces,
using Locality-Sensitive Hashing (LSH) [7], and 2) methods
that employ data reorganization. Techniques in the latter
category consist of two stages: in the first stage, the data
representation is changed (without altering the contents)
such that transactions with similar QID are placed in
proximity to each other. In the second stage, transactions
are assembled into anonymized groups.
RELATED WORK
Privacy-preserving data publishing has received considerable
attention in recent years, especially in the context of
relational data. The work in [10] employs random perturbation
to prevent reidentification of records, by adding noise
to the data. In [11], it is shown that an attacker could filter
the random noise, and hence, breach data privacy, unless
the noise is correlated with the data. However, randomly
perturbed data are not “truthful” [12] in the sense that it
contains records which do not exist in the original data.
Furthermore, random perturbation may expose privacy of
outliers when an attacker has access to external knowledge.
Privacy Requirements
Two main privacy-preserving paradigms have been
established for relational data: k-anonymity [13], which
prevents identification of individual records in the data,
and ‘-diversity [2], which prevents the association of an
individual record with a sensitive attribute value. Observe
that it is the association of individuals with sensitive
information that ultimately threats privacy. Similarly, in
the case of transactional data, the privacy threat is defined
as the association of an individual transaction to a
sensitive item. Nevertheless, the privacy preservation of
transactional data is different from its relational database
counterpart, where all records have the same number
(usually one) of sensitive attributes. In our case, some
transactions may not contain any sensitive item, hence are
completely innocuous, while others may have multiple
sensitive items.
Discussion of NN-Based Anonymization
Methods
The NN-based techniques introduced in this section create
anonymized groups by searching for the nearest neighbors
of sensitive transactions. NN search has the advantage of
maximizing the quality of each individual group. On the
other hand, no global objective is considered, in the sense
that the quality of groups other than the current group is
completely disregarded. Since the NN methods are localized
in nature, their performance depends on the order of
processing-sensitive transactions, and poor results may be
obtained if a particularly unfavorable ordering is chosen. In
the next section, we present another category of anonymization
methods based on global optimization.
CONCLUSIONS
In this paper, the problem of anonymizing sparse, highdimensional
transactional data is solved through methods
based on 1) local NN-search and 2) global data reorganization.
All proposed approaches outperform the existing state
of the art, which does not handle well high data dimensionality.
LSH-based anonymization outperforms the RCM
method in terms of data utility, but incurs slightly higher
computational overhead, whereas Gray code sorting is
superior to all other methods with respect to both data
utility and anonymization overhead.