06-02-2013, 03:31 PM
Privacy-Preserving OLAP: An Information-Theoretic Approach
1Privacy-Preserving.pdf (Size: 1.22 MB / Downloads: 29)
Abstract
We address issues related to the protection of private information in Online Analytical Processing (OLAP) systems, where a
major privacy concern is the adversarial inference of private information from OLAP query answers. Most previous work on privacypreserving
OLAP focuses on a single aggregate function and/or addresses only exact disclosure, which eliminates from consideration
an important class of privacy breaches where partial information, but not exact values, of private data is disclosed (i.e., partial
disclosure). We address privacy protection against both exact and partial disclosure in OLAP systems with mixed aggregate functions.
In particular, we propose an information-theoretic inference control approach that supports a combination of common aggregate
functions (e.g., COUNT, SUM, MIN, MAX, and MEDIAN) and guarantees the level of privacy disclosure not to exceed thresholds
predetermined by the data owners. We demonstrate that our approach is efficient and can be implemented in existing OLAP systems
with little modification. It also satisfies the simulatable auditing model and leaks no private information through query rejections.
Through performance analysis, we show that compared with previous approaches, our approach provides more effective privacy
protection while maintaining a higher level of query-answer availability.
INTRODUCTION
ONLINE analytical processing (OLAP) is one of the most
popular decision support and knowledge discovery
techniques in business-intelligence systems [20]. However, it
is a challenge to enable OLAP on private data without
violating the data owners’ privacy. Traditional studies on
database security provide access control and data sanitization
(e.g., removal of personal-identifiable information) tools
for the relational back end of OLAP systems [19], [21], [25],
[31]. Nonetheless, since an essential feature of OLAP is to
compute the multidimensional aggregates of stored data, a
major privacy concern in OLAP is the adversarial inference
of (individual) private data points from the aggregate
information. This inference problem cannot be fully addressed
by access control and data sanitization techniques.
BACKGROUND
OLAP System Model
We consider an OLAP system where the data warehouse
server stores data in an n-dimensional data cube, in order to
support aggregate queries on an attribute-of-interest over
selected data [20]. We refer to such attribute-of-interest as the
measure attribute. Besides the measure attribute, there are n
dimension attributes, each of which is represented by a
dimension of the data cube. Each (base) cell of the data cube
is the value of the measure attribute for the corresponding
value combination of dimension attributes. For example,
Table 1 shows a two-dimensional data cube with a measure
attribute of sales and two dimension attributes of product
and time. Each cell of the data cube in Table 1 is the sales
amount (i.e., measure attribute) of a product (e.g., Book) in a
month (e.g., April). As in real cases, some cells in the data
cube can be missing or not applicable (i.e., N/A in Table 1).
Privacy Requirements
Due to privacy concerns, the owner of a data cube may not
want a user to have access to all the information stored in it.
Privacy requirements on a data cube may be defined over
the cells in the data cube, and/or the data tuples in the
relational back end of it. In this paper, we choose a cell as the
granularity for specification of privacy requirements due to
our focus on the processing of OLAP queries. Nonetheless,
as evidenced by our privacy measure in Section 3, privacy
requirements on tuples can also be transformed to requirements
on data cube cells. In our model, a user does not have
the right to access all cells in the data cube. We refer to all
cells that a user cannot access as sensitive cells for the user.
A user should not be capable of compromising the value
of a sensitive cell from the query answers it receives. There
are two measures for the compromise of a sensitive cell:
exact and partial disclosure. While exact disclosure occurs
iff the exact value of cell is known to a user, partial
disclosure occurs whenever a user has a significant change
between its prior and posterior belief on the value of a cell.
We shall introduce the formal definitions for exact and
partial disclosure in Section 3.3.
External Knowledge
In practice, besides receiving aggregate query answers, a
user may also learn certain information about a sensitive
cell from sources outside the OLAP system. We refer to
such information as the external knowledge of the user. There
are various types of external knowledge which may lead to
the disclosure of sensitive cells. For example, if a user
knows that the maximum employee salary of a company
belongs to Alice, its CEO, then the salary of Alice can be
inferred from just one aggregate query on the maximum
salary in the data cube. Even a MAX query on the salary of a
subset of the employees can be privacy-divulging, as the
answer can be used by an adversary as a lower-bound
estimate on Alice’s salary.
Performance Measures
An inference control approach should be measured by its
ability to protect privacy and answer OLAP queries. We
define the measures for privacy disclosure and query
availability, respectively, as follows:
Privacy Disclosure Measure
We define a continuous measure for privacy disclosure in
order to capture both exact and partial disclosure of private
information. For each cell x, we assume that the prior
distribution of x is public information to both the data
warehouse server and the users. This is a reasonable
assumption in practice because many attributes, such as
age and salary, have underlying probability distributions
that both the data warehouse server and the users can learn
from external knowledge [23]. We restrict our discussion in
this paper to discrete distributions of x—we assume a simple
discretization of continuous data to resemble discrete values.
INFERENCE CONTROL ALGORITHM
In this section, we present the detailed algorithm of our
information-theoretic inference control approach. In order to
help readers to better understand the methodology of our
proposed approach, we first show the algorithm on a simple
case of two-dimensional data cube in which each cell follows
a uniform distribution on f0; 1g and has a constant lðxÞ. After
that, we present a generic algorithm for n-dimensional data
cubes with cells of arbitrary distribution and heterogeneous
lðxÞ. At the end of this section, we shall show that both
algorithms satisfy the simulatable auditing model [23] and
disclose no private information through query rejections.
Algorithm A: A Simple Two-Dimensional Case
We first consider a simple case with a two-dimensional d1
d2 data cube, in which each sensitive cell x is uniformly
distributed on f0; 1g, and has a constant owner-specified
threshold lðxÞ ¼ l 2 ½0; 1. Note that the assumptions of
uniform distribution and constant lðxÞ (for all cells) are
certainly not realistic models of real-world data. Nonetheless,
we would like to remark that the introduction of
Algorithm A is not intended to promote the practical usage
of it. Instead, we use it as a simple demonstration of the
power of our information-theoretic approach, and a
foundation for the introduction of Algorithm B, which is a
generic and practical algorithm for n-dimensional data
cubes with arbitrary distribution.