Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: STHoles: A Multidimensional Workload Aware Histogram
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
STHoles: A Multidimensional Workload Aware Histogram

[attachment=30111]

ABSTRACT

Attributes of a relation are not typically independent. Mul-
tidimensional histograms can be an e ective tool for accu-
rate multiattribute query selectivity estimation. In this pa-
per, we introduce STHoles, a \workload-aware" histogram
that allows bucket nesting to capture data regions with rea-
sonably uniform tuple density. STHoles histograms are built
without examining the data sets, but rather by just analyz-
ing query results. Buckets are allocated where needed the
most as indicated by the workload, which leads to accurate
query selectivity estimations. Our extensive experiments
demonstrate that STHoles histograms consistently produce
good selectivity estimates across synthetic and real-world
data sets and across query workloads, and, in many cases,
outperform the best multidimensional histogram techniques
that require access to and processing of the full data sets
during histogram construction.

INTRODUCTION

A variety of problems require succinct summary repre-
sentations of large data sets. Histograms are an impor-
tant example of such summary representation structures.
In the database eld, they are mainly used for selectivity
estimation during query optimization [8] and also for ap-
proximate query processing [10, 20] to give rough and fast
responses to expensive queries. Query optimization in re-
lational database systems has traditionally relied on single-
attribute histograms to compute the selectivity of queries.
For queries that involve multiple attributes, most database
systems make the attribute value independence assumption,
i.e., assume that p(A1=v1;A2=v2) = p(A1=v1)p(A2=v2),
which may of course lead to signi cant inaccuracy in selec-
tivity estimation (e.g., see [21]).

RELATED WORK

Several techniques exist in the literature to compute selec-
tivity estimators of multidimensional data sets. These tech-
niques include wavelets (e.g., [13]) and discrete cosine trans-
formations (e.g., [12]), sampling (e.g., [17]), and multidimen-
sional histograms. The focus of this paper is on multidimen-
sional histograms, which have been the topic of much theo-
retical and experimental work in the last few years. A con-
ceptually interesting class of histograms is the V-optimal(f,f)
family [22], which groups contiguous sets of frequencies into
buckets and minimizes the variance of the overall frequency
approximation. These histograms are optimal for estimating
the result size of equality join and selection queries under a
de nition of optimality that captures the average error over
all possible queries and databases [9]. However, these his-
tograms need to record every distinct attribute value inside
each bucket, which is clearly impractical and makes these
techniques be only of theoretical interest. Moreover, the
construction algorithm involves an exhaustive and exponen-
tial enumeration of all possible histograms. A more practical
approach is to restrict the attention to V-optimal(v,f) his-
tograms, which group contiguous sets of values into buckets,
minimizing the variance of the overall frequency approxima-
tion. A dynamic programming algorithm is presented in [11]
for building unidimensional V-optimal(v,f) histograms in
O(N2b) time, where N is the number of tuples in the data
set and b is the number of buckets.

ANALYSISOFEXISTINGMULTIDIMENSIONAL
HISTOGRAMS


Good histograms partition data sets into \smooth" buck-
ets with close-to-uniform internal tuple density. In other
words, the frequency variance of the tuples enclosed by such
buckets is minimized, leading to accurate selectivity esti-
mations for range queries. Unfortunately, current multidi-
mensional histogram techniques do not always manage to
produce close-to-uniform partitions of the data sets, as we
discuss next. Later, Section 6 reports a thorough experi-
mental evaluation of these techniques that complements the
discussion in this section.
A partition of a multidimensional data domain results in
a set of disjoint rectangular buckets that cover all the points
in the domain and assigns to each bucket some aggregated
information, usually the number of tuples enclosed. The
choice of rectangular buckets is justi ed by two main reasons:
First, rectangular buckets make it easy and eƆcient to inter-
sect each bucket and a given range query to estimate selec-
tivities. Second, rectangular buckets can be represented con-
cisely, which allows a large number of buckets to be stored
using the given budget constraints. Reference [16] presents
a taxonomy of partitioning schemes for building multidi-
mensional histograms, which we illustrate in Figure 2. In
the grid partitioning scheme (Figure 2(a)), each dimension
di is divided into pi disjoint intervals, which induce a grid
of Qi pi buckets. A recursive partition (Figure 2(b)) starts
with one bucket covering the whole domain.

Histogram Definition

As explained in the previous section, the inclusion rela-
tionship among buckets provides an extra degree of
ex-
ibility compared to partitioning schemes that use disjoint
buckets. Each bucket b in an STHoles histogram is com-
posed of a rectangular bounding box, denoted box(b), and
a real valued frequency, denoted f(b), which indicates the
number of tuples enclosed by bucket b. In a traditional his-
togram (see Section 2), a bucket b would be \solid," with
no \holes," and hence the region that b covers would be
regarded as having uniform tuple density. In contrast, an
STHoles histogram identi es sub-regions of b with di er-
ent tuple density and \pulls" them out from b. Hence a
bucket b can have holes, which are themselves rst-class his-
togram buckets. These holes are bucket b's children, and
their bounding boxes are disjoint and completely enclosed
in b's bounding box 2. Therefore, an STHoles histogram can
be conceptually seen as a tree structure, where each node
represents a bucket.