16-02-2013, 03:04 PM
Measuring the Sky: On Computing Data Cubes via Skylining the Measures
Measuring the Sky.pdf (Size: 905.52 KB / Downloads: 47)
INTRODUCTION
In the data warehousing environment, OLAP tools have been
extensively used for a wide range of decision support applications
such as sales analysis, customer analysis, marketing, and
services planning. These OLAP tools are built upon a multidimensional
data model, in which data tuples are partitioned into
different cells based on the values of their dimension attributes.
For each cell of tuples, an aggregate function (e.g., SUM()) is
applied to their measure attributes (e.g., sales) and the resulting
aggregated value contributes to the value of that cell. A data
cuboid is then formed by constituting cells that are created
from the same set of dimension attributes. Each data cuboid
represents a unique view of the underlying data. The collection
of data cuboids, which are based on different combinations of
dimension attributes, forms a data cube.
In traditional data cubes, an aggregate function takes as
input a set of measure values and returns a single numeric
value. In this paper, we extend the notion of data cube with a
new perspective. Specifically, we study the issues of building
data cubes that exploit the skyline operator [4] as the postoperation
instead of the traditional aggregate functions. We
name this type of data cubes as group-by skyline cubes. The
skyline operator has been well recognized as a very important
decision-support operator in recent literature and has started to
be implemented in commercial query engines [7]. Given a set
S of skyline attributes, a tuple t is said to dominate another
tuple t0, denoted by t S t0, if
(9 Ai 2 S; t[Ai] < t0[Ai]) ^ (8 Ai 2 S; t[Ai] t0[Ai]) (1)
assuming that smaller values are preferable over larger ones.
Here, we use t[Ai] to represent the value of the attribute Ai
of the tuple t. Given a set D of tuples, the skyline operation
The research was supported by grant PolyU 525009E from Hong Kong RGC.
M. L. Yiu, E. Lo and D. Yung are with the Department of Computing, Hong
Kong Polytechnic University, Hong Kong.
E-mail: fcsmlyiu, ericlo, cskwyungg[at]comp.polyu.edu.hk
on D is defined as:
(D; S) = ft 2 D j @t0 2 D; t0 S tg (2)
In other words, a tuple t belongs to the skyline result set if
no other tuple dominates it.
We observe that a group-by skyline cube has two distinguished
features when compared to a traditional data cube.
First, a cell of a group-by skyline cube stores a set of
skyline tuples rather than a single numeric aggregate value.
Second, since the skyline operation can be applied to multiple
measure attributes, a group-by skyline cube includes not only
a set of usual grouping dimensions, but also a set of skyline
dimensions, in which the latter is defined on the measure
attributes.
Name Region Position Mentality Attitude Flexibility
a Asia Admin 5 5 6
b Asia Admin 4 8 7
c Asia Technical 3 3 4
d Asia Technical 6 6 6
e Europe Technical 2 4 8
f Europe Technical 4 4 3
g Europe Technical 7 7 4
Fig. 1. Employee Database D
We illustrate the concept of group-by skyline cube by the
example in Figure 1, which is a table D of employee records.
Each tuple represents an employee, with the employee’s
‘Region’ and ‘Position’, as well as his/her evaluated scores
‘Mentality’, ‘Attitude’, and ‘Flexibility’. Suppose that lower
scores are preferable over higher ones. If we want to find the
outstanding employees of each position, the table in Figure 2a
can be used to answer the query directly. That table is called
a group-by skyline cuboid with grouping dimension set G =
f‘Position’g and skyline dimension set S = f‘Mentality’,
‘Attitude’, ‘Flexibility’g. That essentially means that the employee
in D are first grouped based on their positions and
then the skyline operation is applied to each group to find
those who are not dominated by the others on all three scores.
G=fPositiong, S=fMentality, Attitude, Flexibilityg
Position Name Mentality Attitude Flexibility
(a) Group-by skyline cuboid C1
G=fRegion, Positiong, S=fMentality, Attitudeg
Region Position Name Mentality Attitude
Asia Admin a 5 5
b 4 8
Asia Technical c 3 3
Europe Technical e 2 4
(b) Group-by skyline cuboid C2
G=fRegion, Positiong, S=fMentalityg
Region Position Name Attitude
Asia Admin a 5
Asia Technical c 3
Europe Technical e 4
f 4
© Group-by skyline cuboid C3
Group-by skyline cuboids
Figures 2b and 2c show two more examples of group-by
skyline cuboids for the employee dataset. Cuboid C2 groups
the employees based on their regions and positions and the
skyline operation is applied to their ‘Mentality’ and ‘Attitude’
attributes. Cuboid C3 essentially partitions the employees in
the same way as C2 but the skyline operation is applied
to only the attribute ‘Attitude’. Hence, for the dataset D
in Figure 1, its corresponding group-by skyline cube is a
collection of all cuboids, whereas each such cuboid Ci(Gi; Si)
has its grouping dimension set Gi as a non-empty subset of
f‘Region’, ‘Position’g and its skyline dimension set Si as a
non-empty subset of f‘Mentality’, ‘Attitude’, ‘Flexibility’g.
We find group-by skyline cube useful for other data analysis
applications as well. One can extend a supermarket’s transactional
data warehouse with group-by skyline cube so that
attributes like ‘Location’, ‘Time’ and ‘Product’ still serve as
the grouping dimensions and the set of measure attributes
like ‘Sales’ and ‘Number of visitors’ serve as the skyline
dimensions. For hotel analysis, we can build a group-by
skyline cube with respect to all dimensions such as ‘Star’ (e.g.,
5-star, 4-star) and ‘City’ and all measure attributes such as
‘Room price’, ‘Room quality’, ‘Transportation convenience’
and ‘Entertainment quality’.
Using the skyline operation as an “aggregate function” in
data cubes and the notion of group-by skyline cubes is a
completely new concept. First, while the skyline operator and
its variants have been extensively studied (e.g., [23], [10]),
none of them has ever discussed the incorporation of the
skyline operation into traditional data warehouses as a postoperation.
Second, although the concept of skycubes [36], [26]
exists, that is fundamentally different from our concept of
group-by skyline cube. Skycubes are designed for the efficient
evaluation of skyline queries in any non-empty subspaces.
In other words, that does not consider any grouping at all.
In practice, however, skyline analysis is often to be more
meaningful if the data tuples are first grouped based on
their dimensions before finding their skylines. Let’s take the
employee dataset in Figure 1 as an example. Without grouping,
finding the skyline of all employee, no matter whether it is
in the full space (i.e., all three measure attributes) or in any
sub-space (e.g., ‘Mentality’ and ‘Attitude’), may not be that
meaningful because it is unfair to compare an Admin staff
with a Technical staff. In contrast, building a group-by skyline
cube for that dataset allows the management to find out the
outstanding employee of each region and/or position. Consider
the NBA player statistics as another example. Obviously,
finding the skyline, despite that is full-space skyline or subspace
skyline, of all NBA players is not that meaningful
because it is unfair to compare Michael Jordon with Wilt
Chamberlain, which were active in NBA in different periods
and played different positions. In contrast, we can analyze
the NBA data using a group-by skyline cube, treating all
dimensions such as ‘Year’, ‘Team’ and ‘Position’ as the set
of all possible grouping dimensions and treating all measure
attributes such as ‘Points’, ‘Assists’ and ‘Rebounds’ as the set
of all possible skyline dimensions.
Implementing the concept of group-by skyline cube in
today’s data warehouses poses several technical challenges.
First, the skyline operation is much more computational intensive
[4], [7] when compared to simple aggregations such
as COUNT(), SUM() and AVG(). Consequently, building or
querying group-by skyline cubes needs to consider not only
the I/O cost, but also the computation (CPU) cost. Second, the
skyline operation is holistic in nature as the skyline of a cuboid
is not necessarily derivable from another cuboid. For example,
consider cuboids C2 and C3 in Figure 2, although they share the
same set of grouping dimensions and the skyline dimension
set of C3 is a subset of that of C2, their skyline results are
not derivable from each other. Specifically, we can see that
C2 is not derivable from C3 because the skyline employee
b in C2 does not exist in C3. Moreover, C3 is not derivable
from C2 because the skyline employee f in C3 does not exist
in C2 as well. Third, the dimensions that define a group-by
skyline cube include not only the usual grouping dimensions
but also the measure attributes. For a group-by skyline cube
that is built on a dataset with jGj grouping dimensions and
jSj measure attributes, there would be (2jGj