Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Online Analytical Processing (OLAP)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Online Analytical Processing (OLAP)


[attachment=52094]

Online Analytical Processing (OLAP) means analysing large quantities of data in real-time. Unlike Online Transaction Processing (OLTP), where typical operations read and modify individual and small numbers of records, OLAP deals with data in bulk, and operations are generally read-only. The term ‘online’ implies that even though huge quantities of data are involved — typically many millions of records, occupying several gigabytes — the system must respond to queries fast enough to allow an interactive exploration of the data. As we shall see, that presents considerable technical challenges.
OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Whereas a relational database stores all data in the form of rows and columns, a multidimensional dataset consists of axes and cells. Consider the dataset



The rows axis consists of the members ‘All products’, ‘Books’, ‘Fiction’, and so forth, and the columns axis consists of the cartesian product of the years ‘2000’ and ‘2001’, and the calculation ‘Growth’, and the measures ‘Unit sales’ and ‘Dollar sales’. Each cell represents the sales of a product category in a particular year; for example, the dollar sales of Magazines in 2001 were $2,426.
This is a richer view of the data than would be presented by a relational database. The members of a multidimensional dataset are not always values from a relational column. ‘Total’, ‘Books’ and ‘Fiction’ are members at successive levels in a hierarchy, each of which is rolled up to the next. And even though it is alongside the years ‘2000’ and ‘2001’, ‘Growth’ is a calculated member, which introduces a formula for computing cells from other cells.



Who uses OLAP and Why?

OLAP applications span a variety of organizational functions. Finance departments use OLAP for applications such as budgeting, activity-based costing (allocations), financial performance analysis, and financial modeling. Sales analysis and forecasting are two of the OLAP applications found in sales departments. Among other applications, marketing departments use OLAP for market research analysis, sales forecasting, promotions analysis, customer analysis, and market/customer segmentation. Typical manufacturing OLAP applications include production planning and defect analysis.
Important to all of the above applications is the ability to provide managers with the information they need to make effective decisions about an organization’s strategic directions. The key indicator of a successful OLAP application is its ability to provide information as needed, i.e., its ability to provide “just-in-time” information for effective decision-making. This requires more than a base level of detailed data.
Just-in-time information is computed data that usually reflects complex relationships and is often calculated on the fly. Analyzing and modeling complex relationships are practical only if response times are consistently short. In addition, because the nature of data relationships may not be known in advance, the data model must be flexible. A truly flexible data model ensures that OLAP systems can respond to changing business requirements as needed for effective decision making. Although OLAP applications are found in widely divergent functional areas, they all require the following key features:




Calculation-intensive capabilities
The real test of an OLAP database is its ability to perform complex calculations. OLAP databases must be able to do more than simple aggregation. While aggregation along a hierarchy is important, there is more to analysis than simple data roll-ups. Examples of more complex calculations include share calculations (percentage of total) and allocations(which use hierarchies from a top-down perspective).
Key performance indicators often require involved algebraic equations. Sales forecasting uses trend algorithms such as moving averages and percentage growth. Analyzing the sales and promotions of a given company and its competitors requires modeling complex relationships among the players. The real world is complicated – the ability to model complex relationships is key in analytical processing applications.
Whereas transaction processing systems are judged on their ability to collect and manage data, analytical processing systems are judged on their ability to create information from data.

Time intelligence

Time is an integral component of almost any analytical application. Time is a unique dimension because it is sequential in character (January always comes before February). True OLAP systems understand the sequential nature of time. Business performance is almost always judged over time, for example, this month vs. last month, this month vs. the same month last year. The time hierarchy is not always used in the same manner as other hierarchies. For example, a manager might ask to see the sales for May or the sales for the first five months of 1995. The same manager might also ask to see the sales for blue shirts but would never ask to see the sales for the first five shirts. Concepts such as year-to-date and period over period comparisons must be easily defined in an OLAP system.
In addition, OLAP systems must understand the concept of balances over time. For example, if a company sold 10 shirts in January, five shirts in February, and 10 shirts in March, then the total balance sold for the quarter would be 25 shirts. If, on the other hand, a company had a head count of 10 employees in January, only five employees in February, and 10 employees again in March, what was the company’s employee head count for the quarter? Most companies would use an average balance. In the case of cash, most companies use an ending balance.
Online Analytical Processing (OLAP)

[attachment=52478]

INTRODUCTION

Online analytical processing (OLAP) is one of the most popular decision support and knowledge discovery techniques in business-intelligence systems However, it is a challenge to enable OLAP on private data without violating the data owners’ privacy. Traditional studies on database security provide access control and data sanitiza-tion (e.g., removal of personal-identifiable information) tools for the relational back end of OLAP systems Nonetheless, since an essential feature of OLAP is to compute the multidimensional aggregates of stored data, a major privacy concern in OLAP is the adversarial inference of (individual) private data points from the aggregate information. This inference problem cannot be fully ad-dressed by access control and data sanitization techniques.
In this paper, we address privacy protection against adversarial inference in an OLAP system consisting of a data warehouse server and a number of users. The data ware-house server holds private data, and is supposed to answer OLAP queries issued by users on the multidimensional aggregates of private data. A user may not have the right to access all individual data points in the data warehouse, but might be allowed to issue OLAP queries on the aggregates of data for which it has no right to access. For example, in a hospital system, the accounting department (as a user) can access each patient’s financial data, but not the patients’medical records (an actual HIPAA requirement). None-theless, the accounting department may query aggregate information related to the medical records, such as the total expense for patients with Alzheimer’s disease.

OLAP System Model

We consider an OLAP system where the data warehouse server stores data in an n-dimensional data cube, in order to support aggregate queries on an attribute-of-interest over selected data We refer to such attribute-of-interest as the measure attribute. Besides the measure attribute, there are n dimension attributes, each of which is represented by a dimension of the data cube. Each (base) cell of the data cube is the value of the measure attribute for the corresponding value combination of dimension attributes. For example, Table 1 shows a two-dimensional data cube with a measure attribute of sales and two dimension attributes of product and time. Each cell of the data cube in Table 1 is the sales amount (i.e., measure attribute) of a product (e.g., Book) in a month (e.g., April). As in real cases, some cells in the data cube can be missing or not applicable (i.e., N/A in Table 1).

Privacy Requirements

Due to privacy concerns, the owner of a data cube may not want a user to have access to all the information stored in it. Privacy requirements on a data cube may be defined over the cells in the data cube, and/or the data tuples in the relational back end of it. In this paper, we choose a cell as the granularity for specification of privacy requirements due to our focus on the processing of OLAP queries. Nonetheless, as evidenced by our privacy measure in Section 3, privacy requirements on tuples can also be transformed to require-ments on data cube cells. In our model, a user does not have the right to access all cells in the data cube. We refer to all cells that a user cannot access as sensitive cells for the user.
A user should not be capable of compromising the value of a sensitive cell from the query answers it receives. There are two measures for the compromise of a sensitive cell: exact and partial disclosure. While exact disclosure occurs iff the exact value of cell is known to a user, partial disclosure occurs whenever a user has a significant change between its prior and posterior belief on the value of a cell. We shall introduce the formal definitions for exact and partial disclosure in Section 3.3.

External Knowledge

In practice, besides receiving aggregate query answers, a user may also learn certain information about a sensitive cell from sources outside the OLAP system. We refer to such information as the external knowledge of the user. There are various types of external knowledge which may lead to the disclosure of sensitive cells. For example, if a user knows that the maximum employee salary of a company belongs to Alice, its CEO, then the salary of Alice can be inferred from just one aggregate query on the maximum salary in the data cube. Even a MAX query on the salary of a subset of the employees can be privacy-divulging, as the answer can be used by an adversary as a lower-bound estimate on Alice’s salary. It is difficult, if not possible, to have a universal model that captures all types of external

System Settings

Let there be one data warehouse server and a number of users C1; . . . ; CU in the system. For each user Ck (k 2 =1; U&), let Gk be the set of cells in the data cube that Ck has the right to access. We refer to Gk as the set of preknown cells of Ck. Since the data warehouse server needs to properly enforce access control on the sensitive cells, we assume that it knows Gk for k 2 =1; U& as preknowledge.
To model the processing of OLAP queries in many real systems, we follow an online setting of the inference control problem which assumes that each user issues multiple OLAP queries sequentially, and the data ware-house server has no knowledge of the upcoming queries. In Section 8.2, we shall discuss an extension of our results to cases where a user can submit multiple queries in a batch. We shall show that while all results in the paper still apply to these cases, the simultaneous submission of multiple queries may allow us to further optimize the performance of inference control.

Models of Privacy Intrusion Attacks and Defensive Countermeasures

A user Ck can be an adversary intending to compromise the private information about sensitive cells x (x Gk) that it has no right to access. In order to do so, Ck may launch an inference attack by inferring private information about x from the preknown cells in Gk as well as the historic query answers in Qk. Throughout this paper, we make a worst-case assumption that each (malicious) user is computation-ally unbounded in that its inference attack may compromise the maximum possible private information about x (x Gk) from Qk and Gk irrespective of the computational cost. In the majority of the paper, we follow a common assumption in the literature that the users do not share Gk and Qk with each other. Nevertheless, we shall extend our approach to address the threats from colluding users in Section 8.1.

OUR NEW APPROACH

In this section, we present the basic idea of our information-theoretic approach for inference control. We first introduce the basic workflow of our approach. Then, we define two categories of aggregate functions, namely MIN-like and SUM-like functions, based on their information-theoretic properties. We use examples to illustrate our basic ideas of dealing with these two types of functions, respectively. The detailed algorithms of our approach will be presented in Section 5.

INFERENCE CONTROL ALGORITHM

In this section, we present the detailed algorithm of our information-theoretic inference control approach. In order to help readers to better understand the methodology of our proposed approach, we first show the algorithm on a simple case of two-dimensional data cube in which each cell follows a uniform distribution on f0; 1g and has a constant l(x). After that, we present a generic algorithm for n-dimensional data cubes with cells of arbitrary distribution and heterogeneous l(x). At the end of this section, we shall show that both algorithms satisfy the simulatable auditing model and disclose no private information through query rejections.

Algorithm A: A Simple Two-Dimensional Case

We first consider a simple case with a two-dimensional d1* d2 data cube, in which each sensitive cell x is uniformly distributed on f0; 1g, and has a constant owner-specified threshold l(x)= l € [0; 1]. Note that the assumptions of uniform distribution and constant l(x) (for all cells) are certainly not realistic models of real-world data. None-theless, we would like to remark that the introduction of Algorithm A is not intended to promote the practical usage of it. Instead, we use it as a simple demonstration of the power of our information-theoretic approach, and a foundation for the introduction of Algorithm B, which is a generic and practical algorithm for n-dimensional data cubes with arbitrary distribution.

Simulatability of Algorithms A and B

We now show that both Algorithms A and B satisfy the simulatable auditing model Note that either algorithm makes the answer/reject decision for a received query q based on q, the query history Qk, the preknown set Gk, the prior distribution of the cells, and nothing else. Neither algorithm uses the values of sensitive cells or the answer to q as input. As such, all inputs to Algorithms A and B are public information available to the users. Thus, the answer/ reject decisions made by the data warehouse server are fully simulatable by the users. Due to the definition of simula-table auditing the auditor (i.e., the data warehouse server) is simulatable with Algorithms A and B.

Processing of Nonskeleton Queries

For the ease of understanding, our previous examples and discussions have been focused on skeleton queries. Although skeleton queries are the main focus of an OLAP system, a user may also issue nonskeleton queries, such as range queries, each of which covers more than one value (but fewer than all values) of an attribute. Since the cells covered by a nonskeleton query can always be separated into multiple nonoverlapping skeleton queries, an easy method to process a nonskeleton query with distributive aggregate function (e.g., MIN, MAX, and SUM) is to transform the nonskeleton query into a sequence of skeleton ones, and to answer the nonskeleton query iff all (trans-formed) skeleton ones can be answered.

Query Availability Analysis

In this section, we analyze the performance of Algorithms A and B on Objective O2 in terms of the query availability level. Since our algorithms deal with MIN-like and SUM-like functions separately, we consider the following three cases, respectively: 1) all queries are MIN-like, 2) all queries are SUM-like, and 3) a combination of MIN-like and SUM-like queries.

SIMULATION RESULTS

In this section, we present simulation results on the performance of Algorithms A and B, respectively, and compare the performance of our approach with previous
ones proposed for inference control in OLAP. As shown in the previous approaches in achieve their best performance in two-dimensional data cubes. Thus, we compare the performance of our approach with the previous ones in the two-dimensional case addressed by Algorithm A.
In particular, for Algorithm A, we conduct the simulation on a 1;000*1;000 data cube, each cell of which is chosen uniformly at random from {0; 1}. We consider various numbers of preknown cells ranging from 0 to 100 percent of all cells, and assume that the preknown cells are randomly distributed in the data cube.