13-11-2012, 02:07 PM
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
ABSTRACT
Preparing a data set for analysis is generally the most
time consuming task in a data mining project, requiring
many complex SQL queries, joining tables, and
aggregating columns. Existing SQL aggregations have
limitations to prepare data sets because they return one
column per aggregated group. In general, a significant
manual effort is required to build data sets, where a
horizontal layout is required. We propose simple, yet
powerful, methods to generate SQL code to return
aggregated columns in a horizontal tabular layout,
returning a set of numbers instead of one number per
row. This new class of functions is called horizontal
aggregations. Horizontal aggregations build data sets
with a horizontal denormalized layout (e.g., pointdimension,
observation variable, instance-feature), which
is the standard layout required by most data mining
algorithms. We propose three fundamental methods to
evaluate horizontal aggregations: CASE: Exploiting the
programming CASE construct; SPJ: Based on standard
relational algebra operators (SPJ queries); PIVOT: Using
the PIVOT operator, which is offered by some DBMSs.
Experiments with large tables compare the proposed
query evaluation methods. Our CASE method has similar
speed to the PIVOT operator and it is much faster than
the SPJ method. In general, the CASE and PIVOT
methods exhibit linear scalability, whereas the SPJ
method does not.