13-09-2014, 01:55 PM
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis Project Report
Horizontal Aggregations.pdf (Size: 265.16 KB / Downloads: 19)
Abstract
Preparing a data set for analysis is generally the
most time consuming task in a data mining project, requiring
many complex SQL queries, joining tables and aggregating
columns. Existing SQL aggregations have limitations to prepare
data sets because they return one column per aggregated
group. In general, a significant manual effort is required to
build data sets, where a horizontal layout is required. We
propose simple, yet powerful, methods to generate SQL code
to return aggregated columns in a horizontal tabular layout,
returning a set of numbers instead of one number per row.
This new class of functions is called horizontal aggregations.
Horizontal aggregations build data sets with a horizontal
denormalized layout (e.g. point-dimension, observation-variable,
instance-feature), which is the standard layout required by
most data mining algorithms. We propose three fundamental
methods to evaluate horizontal aggregations: CASE: Exploiting
the programming CASE construct; SPJ: Based on standard
relational algebra operators (SPJ queries); PIVOT: Using the
PIVOT operator, which is offered by some DBMSs. Experiments
with large tables compare the proposed query evaluation
methods. Our CASE method has similar speed to the PIVOT
operator and it is much faster than the SPJ method. In general,
the CASE and PIVOT methods exhibit linear scalability, whereas
the SPJ method does not.
Advantages
Our proposed horizontal aggregations provide several
unique features and advantages. First, they represent a template
to generate SQL code from a data mining tool. Such SQL code
automates writing SQL queries, optimizing them and testing
them for correctness. This SQL code reduces manual work in
the data preparation phase in a data mining project. Second,
since SQL code is automatically generated it is likely to be
more efficient than SQL code written by an end user. For
instance, a person who does not know SQL well or someone
who is not familiar with the database schema (e.g. a data
mining practitioner). Therefore, data sets can be created in
less time. Third, the data set can be created entirely inside
the DBMS. In modern database environments it is common
to export denormalized data sets to be further cleaned and
transformed outside a DBMS in external tools (e.g. statistical
packages). Unfortunately, exporting large tables outside a
DBMS is slow, creates inconsistent copies of the same data and
compromises database security. Therefore, we provide a more
efficient, better integrated and more secure solution compared
to external data mining tools. Horizontal aggregations just
require a small syntax extension to aggregate functions called
in a SELECT statement. Alternatively, horizontal aggregations
can be used to generate SQL code from a data mining tool to
build data sets for data mining analysis
HORIZONTAL AGGREGATIONS
We introduce a new class of aggregations that have similar
behavior to SQL standard aggregations, but which produce
tables with a horizontal layout. In contrast, we call standard
SQL aggregations vertical aggregations since they produce
tables with a vertical layout. Horizontal aggregations just
require a small syntax extension to aggregate functions called
in a SELECT statement. Alternatively, horizontal aggregations
can be used to generate SQL code from a data mining tool to
build data sets for data mining analysis. We start by explaining
how to automatically generate SQL code
Proposed Syntax in Extended SQL
We now turn our attention to a small syntax extension to the
SELECT statement, which allows understanding our proposal
in an intuitive manner. We must point out the proposed
extension represents non-standard SQL because the columns
in the output table are not known when the query is parsed. We
assume F does not change while a horizontal aggregation is
evaluated because new values may create new result columns.
Conceptually, we extend standard SQL aggregate functions
with a “transposing” BY clause followed by a list of columns
(i.e. R1, . . . , Rk), to produce a horizontal set of numbers
instead of one number. Our proposed syntax is as follows
CONCLUSIONS
We introduced a new class of extended aggregate functions,
called horizontal aggregations which help preparing data sets
for data mining and OLAP cube exploration. Specifically,
horizontal aggregations are useful to create data sets with
a horizontal layout, as commonly required by data mining
algorithms and OLAP cross-tabulation. Basically, a horizontal
aggregation returns a set of numbers instead of a single number
for each group, resembling a multi-dimensional vector. We
proposed an abstract, but minimal