09-05-2012, 03:02 PM
Managing Multidimensional Historical Aggregate Data in Unstructured P2P Networks
Managing Multidimensional Historical Aggregate Data in Unstructured P2P Networks (1).pdf (Size: 2.34 MB / Downloads: 42)
1 INTRODUCTION
PEER-TO-PEER (P2P) networks have become very popular in
the last few years. Nowadays, they are the most
widespread approach for exchanging data among large
communities of users in the file sharing context. In this
scenario, the success of P2P-based solutions is strictly
related to the use of lossy data compression techniques
(such as MPEG formats), which yield reasonable detail
levels in representing large amounts of information and
make data exchange feasible in practice by significantly
reducing data transmission costs. However, the problem of
suitably extending-data-compression-based solutions to
application contexts other than file sharing has not been
deeply investigated yet.
Specifically, no P2P-based solution has imposed itself as
an effective evolution of traditional distributed databases.
This is quite surprising, as the huge amount of resources
provided by P2P networks (in terms of storage capacity,
computing power, and data transmission capability) could
effectively support data management. From this standpoint,
one of the application contexts which are likely to benefit
from the support of a P2P network is the analysis of
multidimensional data. In this scenario, information is
represented as points in a multidimensional space whose
dimensions correspond to different perspectives over data:
users explore data and retrieve aggregates by issuing range
queries, i.e., queries specifying an aggregate operator and
the range of the data domain from which the aggregate
information should be retrieved. Specifically, we will
consider the case of analytical applications dealing with
historical data, which typically require huge computation
and storage capabilities, due to the large amount of data
which need to be accessed to evaluate queries.
Although the multidimensional data model is substantially
more complex than the representation paradigm
adopted in the file sharing context (where data are
organized according to hname; filei pairs), analytical
applications dealing with historical multidimensional data
and file-sharing applications share a fundamental aspect:
they can rely on lossy data compression. In fact, analogously
to tools for reproducing audio and/or video files, a
lot of applications dealing with multidimensional data can
effectively accomplish their tasks even in the case that only
an approximate representation of data is available. For
instance, in Decision Support Systems (DSSs) or statistical
databases, users are often concerned with performing data
exploration with the aim of discovering interesting trends
rather than extracting fine-grained information. In this
scenario, high accuracy in less relevant digits of query
answers is not needed, as providing their order of
magnitude suffices to locate the regions of the database
containing relevant information. At the same time, fast
answers to these preliminary queries allow users to focus
their explorations quickly and effectively, thus saving large
amounts of system resources.
1.1 Goal of the Paper
Our aim is devising a P2P-based framework supporting the
analysis of multidimensional historical data. Specifically,
our efforts will be devoted to combining the amenities of
P2P networks and data compression to provide a support
for the evaluation of range queries, possibly trading off
efficiency with accuracy of answers. The framework should
enable members of an organization to cooperate by sharing
their resources (both storage and computational) to host
(compressed) data and perform aggregate queries on them,
while preserving their autonomy.
A framework with these characteristics can be useful in
different application contexts. For instance, consider the case
of a worldwide virtual organization with users interested in
geographical data, as well as the case of a real organization
on an enterprise network. In both cases, even users who are
not continuously interested in performing data analysis can
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010 1313
. F. Furfaro and A. Pugliese are with the DEIS Department, University of
Calabria, Via P. Bucci, 41C, 87036 Rende (CS), Italy.
E-mail: {furfaro, apugliese}[at]deis.unical.it.
. G.M. Mazzeo is with the Institute of High Performance Computing and
Networking of CNR National Council of Research (ICAR-CNR), Via P.
Bucci, 41C, 87036 Rende (CS), Italy. E-mail: mazzeo[at]icar.cnr.it.
Manuscript received 28 May 2008; revised 11 Feb. 2009; accepted 21 June
2009; published online 2 July 2009.
Recommended for acceptance by D. Srivastava.
For information on obtaining reprints of this article, please send e-mail to:
tkde[at]computer.org, and reference IEEECS Log Number TKDE-2008-05-0288.
Digital Object Identifier no. 10.1109/TKDE.2009.160.
make a part of their resources available for supporting
analysis tasks needed by others, if their own capability of
performing local tasks is preserved. This is analogous to the
idea on which several popular applications for public
resource computing are based. For instance, within the
project SETI@home [39], members of a worldwide community
offer their CPU, when it is idle, to analyze radio
telescope readings in search of nonrandom patterns, such as
spikes in power spectra.
In order to make participants really autonomous, they
should be imposed no constraint on storage and computational
resources to be shared, as well as on the reliability of
their network connection. These requirements make traditional
distributed frameworks unsuitable and suggest the
adoption of a solution based on an unstructured P2P
network, where peers are neither responsible of coordination
tasks (such as super peers, which are called for a certain
amount of resources and reliability), nor imposed to hostspecific
pieces of data (as in DHT-based networks).
1.2 Challenges
The management of compressed data on unstructured P2P
networks is an intriguing issue, but poses several research
challenges, which we discuss in the following.
1.2.1 Compression
A compression technique must be devised which is able to
create “prone-to-be-distributed” data synopses supporting
the efficient evaluation of aggregates, possibly affected by
tolerable error rates. Several compression techniques have
been proposed in the literature in the centralized scenario,
which are able to compute synopses that enable aggregate
queries to be estimated with very good accuracy. These
techniques could be employed in the P2P scenario to build a
single synopsis to be replicated over the network. However,
in this case,
. although the cost of disk storage is continuously
and rapidly decreasing, it may still be difficult to
find peers for which hosting replicas of synopses
has a negligible cost, while autonomy is a requirement
in our setting: using traditional compression
techniques, synopses providing reasonable error
rates may have a non-negligible size (usually not
under 1 percent of the size of the original data set,
e.g., 100 GB from a 10 TB data set).
. although compressing the data certainly makes
replication less resource consuming, replicating the
entire synopsis each time would require storage and
network resources that could be saved if only some
specific portion of the synopsis could be replicated.
We recall that replication is mandatory in the P2P
setting, both to contrast the volatility of peers (which
threatens data availability) and to prevent peers
from being overloaded (in the presence of many
users interested in a data set, if the peers hosting
these data were too few, they would be required to
process a large amount of queries).
These drawbacks would be overcome if the compressed
synopsis were subdivided into tiny subsynopses which are
independently replicated and disseminated on the network
when needed. Peers would, therefore, be asked to host
replicas of small chunks of data. This way, the autonomy
requirement would not result in a limit on the overall size of
the synopsis (since the whole storage capacity of the
network could be employed to store the whole synopsis),
thus enabling the construction of synopses that provide
high-quality estimates. Moreover, the fragmentation could
be exploited to make replication more fine-grained, thus
even less resource-consuming.
An intriguing challenge in this case is investigating
whether it is possible to construct a fragmented synopsis
whose effectiveness is similar to that of a single synopsis
with similar compression factor. In fact, it is well known that
the naive solution of obtaining independent subsynopses by
first partitioning the data and then compressing each portion
separately is less effective than compressing the data
altogether. This is analogous to what generally happens
with common lossless compression algorithms, such as LZW
[43]: compressing n pieces of a text separately results in
n independent synopses whose overall size is larger than the
synopsis obtained by compressing the whole text.
1.2.2 Indexing
Once a compression technique yielding subsynopses with
the desired properties has been defined, the problem of
making the compressed data efficient to be located over the
network must be tackled. Appropriate techniques are thus
needed to distribute the compressed data and index them for
efficient access. In unstructured P2P systems, no indexing
mechanism is generally provided, and data accessibility is
achieved by disseminating replicas. Thus, users explore the
network (through flooding or random walking) until the
desired data are located. This approach could be adapted to
our context by simply disseminating several copies of all the
subsynopses; however, this way, answering a range query
could require several subsynopses to be located, thus
making query evaluation bandwidth-consuming.
A better way to address this issue is to design an
indexing mechanism that supports the efficient location of
the subsynopses involved in the query evaluation. In the
literature, there are several works proposing distributed
indexing techniques, where indexes are variants of R-Trees
which are partitioned and distributed among the nodes of
the network. According to these approaches, nodes of the
networks are assigned groups of nodes of the R-tree, and
maintain references to hosts which are assigned other nodes
of the R-tree. The association between hosts and R-tree
nodes is fixed and the maintenance of the index is
centralized. These solutions, as they are, were devised for
relatively static scenarios, and they are not suitable for the
dynamic scenario addressed by our proposal, where:
. in order to guarantee peer autonomy, peers cannot
be constrained to host a certain portion of the index
or to be always connected to the network; and
. peers are volatile, so the framework must be capable
of promptly reacting to peer disconnections, preventing
dangling references in the index.
Hence, the challenge is to define an indexing technique
supporting the location of our subsynopses in an unstructured
network, such that the portions of the index and the
1314 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, SEPTEMBER 2010
responsibility of hosting it can be dynamically distributed
among the peers, while preserving their autonomy.
1.2.3 Replication
A replication scheme capable of maintaining appropriate
levels of coverage w.r.t. the evolution of user interests and
network conditions must be designed, to ensure accessibility
and robustness. Existing replication strategies for unstructured
P2P networks treat data sets as atomic objects, as they
perform a number of replicas of a data set each time a query is
posed on it. As explained before, in our scenario, this would
limit both the size of the synopsis (thus affecting the accuracy
of the compressed data) and the frequency of replica
creations (thus limiting the responsiveness w.r.t. volatility);
moreover, the index itself must be properly replicated too.
Hence, the challenge is to exploit the fragmentation of
the synopsis and the multidimensionality of data to detect
the specific portions of data and index which need to be
replicated, that is, the regions of the data in which users are
interested most or whose accessibility is not satisfactory, as
well as the portions of the index referencing these regions.
Based on this, a fine-grained replication strategy must be
devised which, instead of “blindly” creating replicas of the
whole synopsis, makes new copies of specific portions of
data and index when needed.
1.3 Our Proposal: The Framework in a Nutshell
Our proposal is a framework supporting the sharing and the
analysis of compressed historical multidimensional data
over an unstructured P2P network. From the user standpoint,
two tasks are supported: data publication and data querying.
1.3.1 Data Publication
Let p be a peer which is willing to share a historical
multidimensional data set D so that the other peers can
pose aggregate range queries against it. In order to make its
data suitable for being distributed across the network, p
builds a synopsis of D by first appropriately partitioning D,
and then, compressing each portion of data in the partition.
Peer p also builds an index over these subsynopses, which,
again, is properly fragmented in order to make it prone to
be distributed. Finally, the subsynopses and the index
portions are disseminated across the network, along with
metadata about D. The assignment of data and index
portions to peers takes into account the willingness of peers
to share their resources.
1.3.2 Data Querying
Exploration queries can be issued by peers to discover
the shared data sets in which they may be interested. These
queries specify criteria that are matched against the
metadata associated with each available data set. The result
of the exploration process is a set of matching data sets and
for each of them, a set of peers that should be contacted to
start the evaluation of range queries, i.e., peers hosting
portions of the distributed index that are thus capable to
appropriately route range queries.
After issuing an explorative query, a peer p can decide to
pose range queries against a matching data set. To do so, p
contacts one of the peers that can route the query toward
the peers hosting the subsynopses needed for evaluating
the query. Finally, p gathers the partial results obtained by
these peers and combines them to compute the final
answer. The completeness of the answer can be checked
through an appropriate mechanism which, if some partial
result has not been received yet, allows p to complete the
answer without issuing the range query from scratch.
The choice of an unstructured P2P network preserves the
autonomy of peers in deciding the amount of storage and
computational resources they wish to make available to
others. Moreover, participants are allowed to join/leave the
system when needed, as suitable mechanisms are designed
which dynamically assign the responsibilities of hosting
data and index portions, properly reacting to peer departures.
The framework adapts the distribution of the index
and the data to both the interest exhibited by the users and
the network conditions. Specifically, with the objective of
reducing the number of peers to be accessed for evaluating
queries, the observed query workload is exploited to drive
the dissemination of replicas across the network: the larger
the number of queries involving a certain index or data
portion, the larger the number of replicas maintained for
that portion. Moreover, in order to take into account the
limited storage and computational capabilities of peers and
their volatility, specific load-balancing and fault-tolerant
techniques are employed.
1.4 Main Contributions and Plan of the Paper
The main contributions of this paper and its organization
may be summed up as follows:
1. a compression technique for building an indexed
aggregate structure over a multidimensional data
population, prone to be distributed, and accessed
across a P2P network (Section 2);
2. a storage model which employs additional data
structures to support efficient and robust query
answering over compressed data in an unstructured
P2P network (Sections 3 and 4); and
3. a dynamic replication scheme capable of maintaining
appropriate levels of coverage w.r.t. the evolution
of the query workload and the network
conditions (Section 5).
The effectiveness of the proposed approaches is assessed
through a thorough experimental analysis in Section 6.
Finally, we discuss related work (Section 7) and outline
conclusions (Section 8).