31-05-2012, 04:05 PM
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis.pdf (Size: 325.67 KB / Downloads: 203)
Abstract
Preparing a data set for analysis is generally the
most time consuming task in a data mining project, requiring
many complex SQL queries, joining tables and aggregating
columns. Existing SQL aggregations have limitations to prepare
data sets because they return one column per aggregated
group. In general, a significant manual effort is required to
build data sets, where a horizontal layout is required. We
propose simple, yet powerful, methods to generate SQL code
to return aggregated columns in a horizontal tabular layout,
returning a set of numbers instead of one number per row.
This new class of functions is called horizontal aggregations.
Horizontal aggregations build data sets with a horizontal
denormalized layout (e.g. point-dimension, observation-variable,
instance-feature), which is the standard layout required by
most data mining algorithms. We propose three fundamental
methods to evaluate horizontal aggregations: CASE: Exploiting
the programming CASE construct; SPJ: Based on standard
relational algebra operators (SPJ queries); PIVOT: Using the
PIVOT operator, which is offered by some DBMSs. Experiments
with large tables compare the proposed query evaluation
methods. Our CASE method has similar speed to the PIVOT
operator and it is much faster than the SPJ method. In general,
the CASE and PIVOT methods exhibit linear scalability, whereas
the SPJ method does not.
Index terms: aggregation; data preparation; pivoting; SQL
I. INTRODUCTION
In a relational database, especially with normalized tables,
a significant effort is required to prepare a summary
data set [16] that can be used as input for a data mining
or statistical algorithm [17], [15]. Most algorithms require
as input a data set with a horizontal layout, with several
records and one variable or dimension per column. That is
the case with models like clustering, classification, regression
and PCA; consult [10], [15]. Each research discipline uses
different terminology to describe the data set. In data mining
the common terms are point-dimension. Statistics literature
generally uses observation-variable.Machine learning research
uses instance-feature. This article introduces a new class of
aggregate functions that can be used to build data sets in a
horizontal layout (denormalized with aggregations), automating
SQL query writing and extending SQL capabilities. We
show evaluating horizontal aggregations is a challenging and
interesting problem and we introduce alternative methods and
optimizations for their efficient evaluation.
A. Motivation
As mentioned above, building a suitable data set for data
mining purposes is a time-consuming task. This task generally
requires writing long SQL statements or customizing SQL
code if it is automatically generated by some tool. There are
two main ingredients in such SQL code: joins and aggregations
[16]; we focus on the second one. The most widely-known
aggregation is the sum of a column over groups of rows. Some
other aggregations return the average, maximum, minimum or
row count over groups of rows. There exist many aggregation
functions and operators in SQL. Unfortunately, all these aggregations
have limitations to build data sets for data mining
purposes. The main reason is that, in general, data sets that
are stored in a relational database (or a data warehouse) come
from On-Line Transaction Processing (OLTP) systems where
database schemas are highly normalized. But data mining,
statistical or machine learning algorithms generally require aggregated
data in summarized form. Based on current available
functions and clauses in SQL, a significant effort is required
to compute aggregations when they are desired in a crosstabular
(horizontal) form, suitable to be used by a data mining
algorithm. Such effort is due to the amount and complexity of
SQL code that needs to be written, optimized and tested. There
are further practical reasons to return aggregation results in
a horizontal (cross-tabular) layout. Standard aggregations are
hard to interpret when there are many result rows, especially
when grouping attributes have high cardinalities. To perform
analysis of exported tables into spreadsheets it may be more
convenient to have aggregations on the same group in one row
(e.g. to produce graphs or to compare data sets with repetitive
information). OLAP tools generate SQL code to transpose
results (sometimes called PIVOT [5]). Transposition can be
more efficient if there are mechanisms combining aggregation
and transposition together.
With such limitations in mind, we propose a new class
of aggregate functions that aggregate numeric expressions
and transpose results to produce a data set with a horizontal
layout. Functions belonging to this class are called horizontal
aggregations. Horizontal aggregations represent an extended
form of traditional SQL aggregations, which return a set of
values in a horizontal layout (somewhat similar to a multidimensional
vector), instead of a single value per row. This
article explains how to evaluate and optimize horizontal aggregations
generating standard SQL code.
B. Advantages
Our proposed horizontal aggregations provide several
unique features and advantages. First, they represent a template
Digital Object Indentifier 10.1109/TKDE.2011.16 1041-4347/11/$26.00 © 2011 IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2
F FV FH
K D1 D2 A
1 3 X 9
2 2 Y 6
3 1 Y 10
4 1 Y 0
5 2 X 1
6 1 X null
7 3 X 8
8 2 X 7
D1 D2 A
1 X null
1 Y 10
2 X 8
2 Y 6
3 X 17
D1 D2X D2Y
1 null 10
2 8 6
3 17 null
Fig. 1. Example of F, FV and FH.
to generate SQL code from a data mining tool. Such SQL code
automates writing SQL queries, optimizing them and testing
them for correctness. This SQL code reduces manual work in
the data preparation phase in a data mining project. Second,
since SQL code is automatically generated it is likely to be
more efficient than SQL code written by an end user. For
instance, a person who does not know SQL well or someone
who is not familiar with the database schema (e.g. a data
mining practitioner). Therefore, data sets can be created in
less time. Third, the data set can be created entirely inside
the DBMS. In modern database environments it is common
to export denormalized data sets to be further cleaned and
transformed outside a DBMS in external tools (e.g. statistical
packages). Unfortunately, exporting large tables outside a
DBMS is slow, creates inconsistent copies of the same data and
compromises database security. Therefore, we provide a more
efficient, better integrated and more secure solution compared
to external data mining tools. Horizontal aggregations just
require a small syntax extension to aggregate functions called
in a SELECT statement. Alternatively, horizontal aggregations
can be used to generate SQL code from a data mining tool to
build data sets for data mining analysis.
C. Article Organization
This article is organized as follows. Section II introduces
definitions and examples. Section III introduces horizontal
aggregations.We propose three methods to evaluate horizontal
aggregations using existing SQL constructs, we prove the
three methods produce the same result and we analyze time
complexity and I/O cost. Section IV presents experiments
comparing evaluation methods, evaluating the impact of optimizations,
assessing scalability and understanding I/O cost
with large tables. Related work is discussed in Section V, comparing
our proposal with existing approaches and positioning
our work within data preparation and OLAP query evaluation.
Section VI gives conclusions and directions for future work.
II. DEFINITIONS
This section defines the table that will be used to explain
SQL queries throughout this work. In order to present
definitions and concepts in an intuitive manner, we present
our definitions in OLAP terms. Let F be a table having a
simple primary key K represented by an integer, p discrete
attributes and one numeric attribute: F(K,D1, . . . , Dp,A).
Our definitions can be easily generalized to multiple numeric
attributes. In OLAP terms, F is a fact table with one column
used as primary key, p dimensions and one measure column
passed to standard SQL aggregations. That is, table F will
be manipulated as a cube with p dimensions [9]. Subsets
of dimension columns are used to group rows to aggregate
the measure column. F is assumed to have a star schema to
simplify exposition. Column K will not be used to compute
aggregations. Dimension lookup tables will be based on simple
foreign keys. That is, one dimension column Dj will be a
foreign key linked to a lookup table that has Dj as primary
key. Input table F size is called N (not to be confused with
n, the size of the answer set). That is, |F| = N. Table F
represents a temporary table or a view based on a “star join”
query on several tables.
We now explain tables FV (vertical) and FH (horizontal)
that are used throughout the article. Consider a standard SQL
aggregation (e.g. sum()) with the GROUP BY clause, which
returns results in a vertical layout. Assume there are j + k
GROUP BY columns and the aggregated attribute is A. The
results are stored on table FV having j + k columns making
up the primary key and A as a non-key attribute. Table FV
has a vertical layout. The goal of a horizontal aggregation
is to transform FV into a table FH with a horizontal layout
having n rows and j+d columns, where each of the d columns
represents a unique combination of the k grouping columns.
Table FV may be more efficient than FH to handle sparse
matrices (having many zeroes), but some DBMSs like SQL
Server [2] can handle sparse columns in a horizontal layout.
The n rows represent records for analysis and the d columns
represent dimensions or features for analysis. Therefore, n is
data set size and d is dimensionality. In other words, each
aggregated column represents a numeric variable as defined
in statistics research or a numeric feature as typically defined
in machine learning research.
A. Examples
Figure 1 gives an example showing the input table F, a
traditional vertical sum() aggregation stored in FV and a horizontal
aggregation stored in FH. The basic SQL aggregation
query is:
SELECT D1,D2,sum(A)
FROM F
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
3
GROUP BY D1,D2
ORDER BY D1,D2;
Notice table FV has only five rows because D1=3 and
D2=Y do not appear together. Also, the first row in FV has null
in A following SQL evaluation semantics. On the other hand,
table FH has three rows and two (d = 2) non-key columns,
effectively storing six aggregated values. In FH it is necessary
to populate the last row with null. Therefore, nulls may come
from F or may be introduced by the horizontal layout.
We now give other examples with a store (retail) database
that requires data mining analysis. To give examples of F, we
will use a table transactionLine that represents the transaction
table from a store. Table transactionLine has dimensions
grouped in three taxonomies (product hierarchy, location,
time), used to group rows, and three measures represented
by itemQty, costAmt and salesAmt, to pass as arguments
to aggregate functions.
We want to compute queries like ”summarize sales for each
store by each day of the week”; ”compute the total number of
items sold by department for each store”. These queries can
be answered with standard SQL, but additional code needs to
be written or generated to return results in tabular (horizontal)
form. Consider the following two queries.
SELECT storeId,dayofweekNo,sum(salesAmt)
FROM transactionLine
GROUP BY storeId,dayweekNo
ORDER BY storeId,dayweekNo;
SELECT storeId,deptId,sum(itemqty)
FROM transactionLine
GROUP BY storeId,deptId
ORDER BY storeId,deptId;
Assume there are 200 stores, 30 store departments and
stores are open 7 days a week. The first query returns 1400
rows which may be time-consuming to compare with each
other each day of the week to get trends. The second query
returns 6000 rows, which in a similar manner, makes difficult
to compare store performance across departments. Even
further, if we want to build a data mining model by store
(e.g. clustering, regression), most algorithms require store id
as primary key and the remaining aggregated columns as
non-key columns. That is, data mining algorithms expect a
horizontal layout. In addition, a horizontal layout is generally
more I/O efficient than a vertical layout for analysis. Notice
these queries have ORDER BY clauses to make output easier
to understand, but such order is irrelevant for data mining
algorithms. In general, we omit ORDER BY clauses.
B. Typical Data Mining Problems
Let us consider data mining problems that may be solved
by typical data mining or statistical algorithms, which assume
each non-key column represents a dimension, variable (statistics)
or feature (machine learning). Stores can be clustered
based on sales for each day of the week. On the other hand,
we can predict sales per store department based on the sales
in other departments using decision trees or regression. PCA
analysis on department sales can reveal which departments
tend to sell together. We can find out potential correlation
of number of employees by gender within each department.
Most data mining algorithms (e.g. clustering, decision trees,
regression, correlation analysis) require result tables from
these queries to be transformed into a horizontal layout. We
must mention there exist data mining algorithms that can
directly analyze data sets having a vertical layout (e.g. in
transaction format) [14], but they require reprogramming the
algorithm to have a better I/O pattern and they are efficient
only when there many zero values (i.e. sparse matrices).
III. HORIZONTAL AGGREGATIONS
We introduce a new class of aggregations that have similar
behavior to SQL standard aggregations, but which produce
tables with a horizontal layout. In contrast, we call standard
SQL aggregations vertical aggregations since they produce
tables with a vertical layout. Horizontal aggregations just
require a small syntax extension to aggregate functions called
in a SELECT statement. Alternatively, horizontal aggregations
can be used to generate SQL code from a data mining tool to
build data sets for data mining analysis. We start by explaining
how to automatically generate SQL code.