04-03-2013, 09:40 AM
Online Analytical Processing (OLAP)
Online Analytical.docx (Size: 1.02 MB / Downloads: 29)
INTRODUCTION
Online analytical processing (OLAP) is one of the most popular decision support and knowledge discovery techniques in business-intelligence systems However, it is a challenge to enable OLAP on private data without violating the data owners’ privacy. Traditional studies on database security provide access control and data sanitiza-tion (e.g., removal of personal-identifiable information) tools for the relational back end of OLAP systems Nonetheless, since an essential feature of OLAP is to compute the multidimensional aggregates of stored data, a major privacy concern in OLAP is the adversarial inference of (individual) private data points from the aggregate information. This inference problem cannot be fully ad-dressed by access control and data sanitization techniques.
In this paper, we address privacy protection against adversarial inference in an OLAP system consisting of a data warehouse server and a number of users. The data ware-house server holds private data, and is supposed to answer OLAP queries issued by users on the multidimensional aggregates of private data. A user may not have the right to access all individual data points in the data warehouse, but might be allowed to issue OLAP queries on the aggregates of data for which it has no right to access. For example, in a hospital system, the accounting department (as a user) can access each patient’s financial data, but not the patients’medical records (an actual HIPAA requirement). None-theless, the accounting department may query aggregate information related to the medical records, such as the total expense for patients with Alzheimer’s disease.
OLAP System Model
We consider an OLAP system where the data warehouse server stores data in an n-dimensional data cube, in order to support aggregate queries on an attribute-of-interest over selected data We refer to such attribute-of-interest as the measure attribute. Besides the measure attribute, there are n dimension attributes, each of which is represented by a dimension of the data cube. Each (base) cell of the data cube is the value of the measure attribute for the corresponding value combination of dimension attributes. For example, Table 1 shows a two-dimensional data cube with a measure attribute of sales and two dimension attributes of product and time. Each cell of the data cube in Table 1 is the sales amount (i.e., measure attribute) of a product (e.g., Book) in a month (e.g., April). As in real cases, some cells in the data cube can be missing or not applicable (i.e., N/A in Table 1).
Privacy Requirements
Due to privacy concerns, the owner of a data cube may not want a user to have access to all the information stored in it. Privacy requirements on a data cube may be defined over the cells in the data cube, and/or the data tuples in the relational back end of it. In this paper, we choose a cell as the granularity for specification of privacy requirements due to our focus on the processing of OLAP queries. Nonetheless, as evidenced by our privacy measure in Section 3, privacy requirements on tuples can also be transformed to require-ments on data cube cells. In our model, a user does not have the right to access all cells in the data cube. We refer to all cells that a user cannot access as sensitive cells for the user.
A user should not be capable of compromising the value of a sensitive cell from the query answers it receives. There are two measures for the compromise of a sensitive cell: exact and partial disclosure. While exact disclosure occurs iff the exact value of cell is known to a user, partial disclosure occurs whenever a user has a significant change between its prior and posterior belief on the value of a cell. We shall introduce the formal definitions for exact and partial disclosure in Section 3.3.
External Knowledge
In practice, besides receiving aggregate query answers, a user may also learn certain information about a sensitive cell from sources outside the OLAP system. We refer to such information as the external knowledge of the user. There are various types of external knowledge which may lead to the disclosure of sensitive cells. For example, if a user knows that the maximum employee salary of a company belongs to Alice, its CEO, then the salary of Alice can be inferred from just one aggregate query on the maximum salary in the data cube. Even a MAX query on the salary of a subset of the employees can be privacy-divulging, as the answer can be used by an adversary as a lower-bound estimate on Alice’s salary. It is difficult, if not possible, to have a universal model that captures all types of external
System Settings
Let there be one data warehouse server and a number of users C1; . . . ; CU in the system. For each user Ck (k 2 =1; U&), let Gk be the set of cells in the data cube that Ck has the right to access. We refer to Gk as the set of preknown cells of Ck. Since the data warehouse server needs to properly enforce access control on the sensitive cells, we assume that it knows Gk for k 2 =1; U& as preknowledge.
To model the processing of OLAP queries in many real systems, we follow an online setting of the inference control problem which assumes that each user issues multiple OLAP queries sequentially, and the data ware-house server has no knowledge of the upcoming queries. In Section 8.2, we shall discuss an extension of our results to cases where a user can submit multiple queries in a batch. We shall show that while all results in the paper still apply to these cases, the simultaneous submission of multiple queries may allow us to further optimize the performance of inference control.
Models of Privacy Intrusion Attacks and Defensive Countermeasures
A user Ck can be an adversary intending to compromise the private information about sensitive cells x (x Gk) that it has no right to access. In order to do so, Ck may launch an inference attack by inferring private information about x from the preknown cells in Gk as well as the historic query answers in Qk. Throughout this paper, we make a worst-case assumption that each (malicious) user is computation-ally unbounded in that its inference attack may compromise the maximum possible private information about x (x Gk) from Qk and Gk irrespective of the computational cost. In the majority of the paper, we follow a common assumption in the literature that the users do not share Gk and Qk with each other. Nevertheless, we shall extend our approach to address the threats from colluding users in Section 8.1.
OUR NEW APPROACH
In this section, we present the basic idea of our information-theoretic approach for inference control. We first introduce the basic workflow of our approach. Then, we define two categories of aggregate functions, namely MIN-like and SUM-like functions, based on their information-theoretic properties. We use examples to illustrate our basic ideas of dealing with these two types of functions, respectively. The detailed algorithms of our approach will be presented in Section 5.
INFERENCE CONTROL ALGORITHM
In this section, we present the detailed algorithm of our information-theoretic inference control approach. In order to help readers to better understand the methodology of our proposed approach, we first show the algorithm on a simple case of two-dimensional data cube in which each cell follows a uniform distribution on f0; 1g and has a constant l(x). After that, we present a generic algorithm for n-dimensional data cubes with cells of arbitrary distribution and heterogeneous l(x). At the end of this section, we shall show that both algorithms satisfy the simulatable auditing model and disclose no private information through query rejections.
Algorithm A: A Simple Two-Dimensional Case
We first consider a simple case with a two-dimensional d1* d2 data cube, in which each sensitive cell x is uniformly distributed on f0; 1g, and has a constant owner-specified threshold l(x)= l € [0; 1]. Note that the assumptions of uniform distribution and constant l(x) (for all cells) are certainly not realistic models of real-world data. None-theless, we would like to remark that the introduction of Algorithm A is not intended to promote the practical usage of it. Instead, we use it as a simple demonstration of the power of our information-theoretic approach, and a foundation for the introduction of Algorithm B, which is a generic and practical algorithm for n-dimensional data cubes with arbitrary distribution.
Simulatability of Algorithms A and B
We now show that both Algorithms A and B satisfy the simulatable auditing model Note that either algorithm makes the answer/reject decision for a received query q based on q, the query history Qk, the preknown set Gk, the prior distribution of the cells, and nothing else. Neither algorithm uses the values of sensitive cells or the answer to q as input. As such, all inputs to Algorithms A and B are public information available to the users. Thus, the answer/ reject decisions made by the data warehouse server are fully simulatable by the users. Due to the definition of simula-table auditing the auditor (i.e., the data warehouse server) is simulatable with Algorithms A and B.
Processing of Nonskeleton Queries
For the ease of understanding, our previous examples and discussions have been focused on skeleton queries. Although skeleton queries are the main focus of an OLAP system, a user may also issue nonskeleton queries, such as range queries, each of which covers more than one value (but fewer than all values) of an attribute. Since the cells covered by a nonskeleton query can always be separated into multiple nonoverlapping skeleton queries, an easy method to process a nonskeleton query with distributive aggregate function (e.g., MIN, MAX, and SUM) is to transform the nonskeleton query into a sequence of skeleton ones, and to answer the nonskeleton query iff all (trans-formed) skeleton ones can be answered.
Query Availability Analysis
In this section, we analyze the performance of Algorithms A and B on Objective O2 in terms of the query availability level. Since our algorithms deal with MIN-like and SUM-like functions separately, we consider the following three cases, respectively: 1) all queries are MIN-like, 2) all queries are SUM-like, and 3) a combination of MIN-like and SUM-like queries.
SIMULATION RESULTS
In this section, we present simulation results on the performance of Algorithms A and B, respectively, and compare the performance of our approach with previous
ones proposed for inference control in OLAP. As shown in the previous approaches in achieve their best performance in two-dimensional data cubes. Thus, we compare the performance of our approach with the previous ones in the two-dimensional case addressed by Algorithm A.
In particular, for Algorithm A, we conduct the simulation on a 1;000*1;000 data cube, each cell of which is chosen uniformly at random from {0; 1}. We consider various numbers of preknown cells ranging from 0 to 100 percent of all cells, and assume that the preknown cells are randomly distributed in the data cube.
Online Analytical.docx (Size: 1.02 MB / Downloads: 29)
INTRODUCTION
Online analytical processing (OLAP) is one of the most popular decision support and knowledge discovery techniques in business-intelligence systems However, it is a challenge to enable OLAP on private data without violating the data owners’ privacy. Traditional studies on database security provide access control and data sanitiza-tion (e.g., removal of personal-identifiable information) tools for the relational back end of OLAP systems Nonetheless, since an essential feature of OLAP is to compute the multidimensional aggregates of stored data, a major privacy concern in OLAP is the adversarial inference of (individual) private data points from the aggregate information. This inference problem cannot be fully ad-dressed by access control and data sanitization techniques.
In this paper, we address privacy protection against adversarial inference in an OLAP system consisting of a data warehouse server and a number of users. The data ware-house server holds private data, and is supposed to answer OLAP queries issued by users on the multidimensional aggregates of private data. A user may not have the right to access all individual data points in the data warehouse, but might be allowed to issue OLAP queries on the aggregates of data for which it has no right to access. For example, in a hospital system, the accounting department (as a user) can access each patient’s financial data, but not the patients’medical records (an actual HIPAA requirement). None-theless, the accounting department may query aggregate information related to the medical records, such as the total expense for patients with Alzheimer’s disease.
OLAP System Model
We consider an OLAP system where the data warehouse server stores data in an n-dimensional data cube, in order to support aggregate queries on an attribute-of-interest over selected data We refer to such attribute-of-interest as the measure attribute. Besides the measure attribute, there are n dimension attributes, each of which is represented by a dimension of the data cube. Each (base) cell of the data cube is the value of the measure attribute for the corresponding value combination of dimension attributes. For example, Table 1 shows a two-dimensional data cube with a measure attribute of sales and two dimension attributes of product and time. Each cell of the data cube in Table 1 is the sales amount (i.e., measure attribute) of a product (e.g., Book) in a month (e.g., April). As in real cases, some cells in the data cube can be missing or not applicable (i.e., N/A in Table 1).
Privacy Requirements
Due to privacy concerns, the owner of a data cube may not want a user to have access to all the information stored in it. Privacy requirements on a data cube may be defined over the cells in the data cube, and/or the data tuples in the relational back end of it. In this paper, we choose a cell as the granularity for specification of privacy requirements due to our focus on the processing of OLAP queries. Nonetheless, as evidenced by our privacy measure in Section 3, privacy requirements on tuples can also be transformed to require-ments on data cube cells. In our model, a user does not have the right to access all cells in the data cube. We refer to all cells that a user cannot access as sensitive cells for the user.
A user should not be capable of compromising the value of a sensitive cell from the query answers it receives. There are two measures for the compromise of a sensitive cell: exact and partial disclosure. While exact disclosure occurs iff the exact value of cell is known to a user, partial disclosure occurs whenever a user has a significant change between its prior and posterior belief on the value of a cell. We shall introduce the formal definitions for exact and partial disclosure in Section 3.3.
External Knowledge
In practice, besides receiving aggregate query answers, a user may also learn certain information about a sensitive cell from sources outside the OLAP system. We refer to such information as the external knowledge of the user. There are various types of external knowledge which may lead to the disclosure of sensitive cells. For example, if a user knows that the maximum employee salary of a company belongs to Alice, its CEO, then the salary of Alice can be inferred from just one aggregate query on the maximum salary in the data cube. Even a MAX query on the salary of a subset of the employees can be privacy-divulging, as the answer can be used by an adversary as a lower-bound estimate on Alice’s salary. It is difficult, if not possible, to have a universal model that captures all types of external
System Settings
Let there be one data warehouse server and a number of users C1; . . . ; CU in the system. For each user Ck (k 2 =1; U&), let Gk be the set of cells in the data cube that Ck has the right to access. We refer to Gk as the set of preknown cells of Ck. Since the data warehouse server needs to properly enforce access control on the sensitive cells, we assume that it knows Gk for k 2 =1; U& as preknowledge.
To model the processing of OLAP queries in many real systems, we follow an online setting of the inference control problem which assumes that each user issues multiple OLAP queries sequentially, and the data ware-house server has no knowledge of the upcoming queries. In Section 8.2, we shall discuss an extension of our results to cases where a user can submit multiple queries in a batch. We shall show that while all results in the paper still apply to these cases, the simultaneous submission of multiple queries may allow us to further optimize the performance of inference control.
Models of Privacy Intrusion Attacks and Defensive Countermeasures
A user Ck can be an adversary intending to compromise the private information about sensitive cells x (x Gk) that it has no right to access. In order to do so, Ck may launch an inference attack by inferring private information about x from the preknown cells in Gk as well as the historic query answers in Qk. Throughout this paper, we make a worst-case assumption that each (malicious) user is computation-ally unbounded in that its inference attack may compromise the maximum possible private information about x (x Gk) from Qk and Gk irrespective of the computational cost. In the majority of the paper, we follow a common assumption in the literature that the users do not share Gk and Qk with each other. Nevertheless, we shall extend our approach to address the threats from colluding users in Section 8.1.
OUR NEW APPROACH
In this section, we present the basic idea of our information-theoretic approach for inference control. We first introduce the basic workflow of our approach. Then, we define two categories of aggregate functions, namely MIN-like and SUM-like functions, based on their information-theoretic properties. We use examples to illustrate our basic ideas of dealing with these two types of functions, respectively. The detailed algorithms of our approach will be presented in Section 5.
INFERENCE CONTROL ALGORITHM
In this section, we present the detailed algorithm of our information-theoretic inference control approach. In order to help readers to better understand the methodology of our proposed approach, we first show the algorithm on a simple case of two-dimensional data cube in which each cell follows a uniform distribution on f0; 1g and has a constant l(x). After that, we present a generic algorithm for n-dimensional data cubes with cells of arbitrary distribution and heterogeneous l(x). At the end of this section, we shall show that both algorithms satisfy the simulatable auditing model and disclose no private information through query rejections.
Algorithm A: A Simple Two-Dimensional Case
We first consider a simple case with a two-dimensional d1* d2 data cube, in which each sensitive cell x is uniformly distributed on f0; 1g, and has a constant owner-specified threshold l(x)= l € [0; 1]. Note that the assumptions of uniform distribution and constant l(x) (for all cells) are certainly not realistic models of real-world data. None-theless, we would like to remark that the introduction of Algorithm A is not intended to promote the practical usage of it. Instead, we use it as a simple demonstration of the power of our information-theoretic approach, and a foundation for the introduction of Algorithm B, which is a generic and practical algorithm for n-dimensional data cubes with arbitrary distribution.
Simulatability of Algorithms A and B
We now show that both Algorithms A and B satisfy the simulatable auditing model Note that either algorithm makes the answer/reject decision for a received query q based on q, the query history Qk, the preknown set Gk, the prior distribution of the cells, and nothing else. Neither algorithm uses the values of sensitive cells or the answer to q as input. As such, all inputs to Algorithms A and B are public information available to the users. Thus, the answer/ reject decisions made by the data warehouse server are fully simulatable by the users. Due to the definition of simula-table auditing the auditor (i.e., the data warehouse server) is simulatable with Algorithms A and B.
Processing of Nonskeleton Queries
For the ease of understanding, our previous examples and discussions have been focused on skeleton queries. Although skeleton queries are the main focus of an OLAP system, a user may also issue nonskeleton queries, such as range queries, each of which covers more than one value (but fewer than all values) of an attribute. Since the cells covered by a nonskeleton query can always be separated into multiple nonoverlapping skeleton queries, an easy method to process a nonskeleton query with distributive aggregate function (e.g., MIN, MAX, and SUM) is to transform the nonskeleton query into a sequence of skeleton ones, and to answer the nonskeleton query iff all (trans-formed) skeleton ones can be answered.
Query Availability Analysis
In this section, we analyze the performance of Algorithms A and B on Objective O2 in terms of the query availability level. Since our algorithms deal with MIN-like and SUM-like functions separately, we consider the following three cases, respectively: 1) all queries are MIN-like, 2) all queries are SUM-like, and 3) a combination of MIN-like and SUM-like queries.
SIMULATION RESULTS
In this section, we present simulation results on the performance of Algorithms A and B, respectively, and compare the performance of our approach with previous
ones proposed for inference control in OLAP. As shown in the previous approaches in achieve their best performance in two-dimensional data cubes. Thus, we compare the performance of our approach with the previous ones in the two-dimensional case addressed by Algorithm A.
In particular, for Algorithm A, we conduct the simulation on a 1;000*1;000 data cube, each cell of which is chosen uniformly at random from {0; 1}. We consider various numbers of preknown cells ranging from 0 to 100 percent of all cells, and assume that the preknown cells are randomly distributed in the data cube.