24-01-2013, 01:01 PM
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Horizontal Aggregations.pdf (Size: 956.13 KB / Downloads: 65)
Abstract
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many
complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets
because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a
horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a
horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal
aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observationvariable,
instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental
methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational
algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables
compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster
than the SPJ method.
INTRODUCTION
IN a relational database, especially with normalized tables,
a significant effort is required to prepare a summary data
set [16] that can be used as input for a data mining or
statistical algorithm [17], [15]. Most algorithms require as
input a data set with a horizontal layout, with several
records and one variable or dimension per column. That is
the case with models like clustering, classification, regression,
and PCA; consult [10], [15]. Each research discipline
uses different terminology to describe the data set. In data
mining the common terms are point-dimension. Statistics
literature generally uses observation-variable. Machine
learning research uses instance-feature. This paper introduces
a new class of aggregate functions that can be used to
build data sets in a horizontal layout (denormalized with
aggregations), automating SQL query writing and extending
SQL capabilities. We show evaluating horizontal
aggregations is a challenging and interesting problem and
we introduce alternative methods and optimizations for
their efficient evaluation.
Motivation
As mentioned above, building a suitable data set for data
mining purposes is a time-consuming task. This task
generally requires writing long SQL statements or customizing
SQL code if it is automatically generated by some
tool. There are two main ingredients in such SQL code: joins
and aggregations [16]; we focus on the second one. The
most widely known aggregation is the sum of a column
over groups of rows.
Advantages
Our proposed horizontal aggregations provide several
unique features and advantages. First, they represent a
template to generate SQL code from a data mining tool. Such
SQL code automates writing SQL queries, optimizing them,
and testing them for correctness. This SQL code reduces
manual work in the data preparation phase in a data mining
project. Second, since SQL code is automatically generated it
is likely to be more efficient than SQL code written by an end
user. For instance, a person who does not know SQL well or
someone who is not familiar with the database schema (e.g.,
a data mining practitioner). Therefore, data sets can be
created in less time. Third, the data set can be created entirely
inside the DBMS. In modern database environments, it is
common to export denormalized data sets to be further
cleaned and transformed outside a DBMS in external tools
(e.g., statistical packages). Unfortunately, exporting large
tables outside a DBMS is slow, creates inconsistent copies of
the same data and compromises database security.
Paper Organization
This paper is organized as follows: Section 2 introduces
definitions and examples. Section 3 introduces horizontal
aggregations. We propose three methods to evaluate
horizontal aggregations using existing SQL constructs, we
prove the three methods produce the same result and
we analyze time complexity and I/O cost. Section 4
presents experiments comparing evaluation methods, evaluating
the impact of optimizations, assessing scalability,
and understanding I/O cost with large tables. Related work
is discussed in Section 5, comparing our proposal with
existing approaches and positioning our work within data
preparation and OLAP query evaluation. Section 6 gives
conclusions and directions for future work.
HORIZONTAL AGGREGATIONS
We introduce a new class of aggregations that have similar
behavior to SQL standard aggregations, but which produce
tables with a horizontal layout. In contrast, we call standard
SQL aggregations vertical aggregations since they produce
tables with a vertical layout. Horizontal aggregations just
require a small syntax extension to aggregate functions
called in a SELECT statement. Alternatively, horizontal
aggregations can be used to generate SQL code from a data
mining tool to build data sets for data mining analysis. We
start by explaining how to automatically generate SQL code.
SQL Code Generation: Query Evaluation
Methods
We propose three methods to evaluate horizontal aggregations.
The first method relies only on relational operations.
That is, only doing select, project, join, and aggregation
queries; we call it the SPJ method. The second form relies on
the SQL “case” construct; we call it the CASE method. Each
table has an index on its primary key for efficient join
processing. We do not consider additional indexing
mechanisms to accelerate query evaluation. The third
method uses the built-in PIVOT operator, which transforms
rows to columns (e.g., transposing). Figs. 2 and 3 show an
overview of the main steps to be explained below (for a
sum() aggregation).
Query Optimizations
Table 3 analyzes our first query optimization, applied to
three methods. Our goal is to assess the acceleration
obtained by precomputing a cube and storing it on FV .
We can see this optimization uniformly accelerates all
methods. This optimization provides a different gain,
depending on the method: for SPJ the optimization is best
for small n, for PIVOT for large n and for CASE there is
rather a less dramatic improvement all across n. It is
noteworthy PIVOT is accelerated by our optimization,
despite the fact it is handled by the query optimizer. Since
this optimization produces significant acceleration for the
three methods (at least 2 faster) we will use it by default.
Notice that precomputing FV takes the same time within
each method. Therefore, comparisons are fair.
Time Complexity
We now verify the time complexity analysis given in
Section 3.7. We plot time complexity keeping varying one
parameter and the remaining parameters fixed. In these
experiments, we generated synthetic data sets similar to
the fact table of TPC-H of different sizes with grouping
columns of varying selectivities (number of distinct values).
We consider two basic probabilistic distribution of values:
uniform (unskewed) and zipf (skewed). The uniform
distribution is the distribution used by default.
Fig. 4 shows the impact of increasing the size of the fact
table (N). The left plot analyzes a small FH table (n ¼ 1k),
whereas the right plot presents a much larger FH table
(n ¼ 128k). Recall from Section 3.7 time is OðNlog2NÞ þ N
when N grows and the other sizes (n; d) are fixed (the first
term corresponds to SELECT DISTINCT and the second one
to the method). The trend indicates all evaluation methods
get indeed impacted by N. Times tend to be linear as N
grows for the three methods, which shows the method
queries have more weight than getting the distinct dimension
combinations. The PIVOT and CASE methods show
very similar performance. On the other hand.
CONCLUSIONS
We introduced a new class of extended aggregate functions,
called horizontal aggregations which help preparing data
sets for data mining and OLAP cube exploration. Specifically,
horizontal aggregations are useful to create data sets
with a horizontal layout, as commonly required by data
mining algorithms and OLAP cross-tabulation. Basically, a
horizontal aggregation returns a set of numbers instead of a
single number for each group, resembling a multidimensional
vector. We proposed an abstract, but minimal,
extension to SQL standard aggregate functions to compute
horizontal aggregations which just requires specifying
subgrouping columns inside the aggregation function call.
From a query optimization perspective, we proposed three
query evaluation methods. The first one (SPJ) relies on
standard relational operators. The second one (CASE) relies
on the SQL CASE construct. The third (PIVOT) uses a builtin
operator in a commercial DBMS that is not widely
available.