19-09-2014, 02:48 PM
On Data Staging Algorithms for Shared
Data Accesses in Clouds
On Data.pdf (Size: 1.24 MB / Downloads: 53)
Abstract—
In this paper, we study the strategies for efficiently achieving data staging and caching on a set of vantage sites in a cloud
system with a minimum cost. Unlike the traditional research, we do not intend to identify the access patterns to facilitate the future
requests. Instead, with such a kind of information presumably known in advance, our goal is to efficiently stage the shared data items
to predetermined sites at advocated time instants to align with the patterns while minimizing the monetary costs for caching and
transmitting the requested data items. To this end, we follow the cost and network models in [1] and extend the analysis to multiple
data items, each with single or multiple copies. Our results show that under homogeneous cost model, when the ratio of transmission
cost and caching cost is low, a single copy of each data item can efficiently serve all the user requests. While in multicopy situation, we
also consider the tradeoff between the transmission cost and caching cost by controlling the upper bounds of transmissions and
copies. The upper bound can be given either on per-item basis or on all-item basis. We present efficient optimal solutions based on
dynamic programming techniques to all these cases provided that the upper bound is polynomially bounded by the number of service
requests and the number of distinct data items. In addition to the homogeneous cost model, we also briefly discuss this problem under
a heterogeneous cost model with some simple yet practical restrictions and present a 2-approximation algorithm to the general case.
We validate our findings by implementing a data staging solver, whereby conducting extensive simulation studies on the behaviors of
the algorithms.
PROBLEM FORMULATION
Suppose there are k distinct shared data items initially
stored at one node, say p1 and later migrated, replicated,
and cached in a fully connected network with m nodes
(p1; p2; :::; pm) to serve a sequence of requests ¼
4 12:::n in
which each i ¼
4 ðti; pi; RiÞ; 1 i n is specified by the
predicted access pattern and represents a request made for
a data subset Ri by node pi at time ti. We therefore have the
complete knowledge of all such information as an input to
our algorithms. Further, for simplicity we also assume that
there exists only one request per stage. However, for
multiple-request case we also briefly discuss it as a
supplemental result presented in Appendix A, available
in the online supplemental material.
To s
OPTIMAL DATA STAGING ALGORITHMS
In this section, with the homogeneous cost model, we first
present the multicopy algorithm to handle multiple distinct
data items, and then propose our extension to address the
constraints on the maximum number of transmissions and
copies. Note that we do not discuss the single-copy
algorithm separately as it can be treated as a special case
of multicopy algorithm
7 CONCLUSIONS
In this paper, we studied the problem of staging a set of
distinct data items in a fully connected network to facilitate
cloud-based services with minimum cost. To this end, we
investigated the optimal staging strategies based on the cost
models in the paper [1] to minimize the total staging cost.
We also considered some practical constraints on this
problem in terms of the maximum number of transmissions
and copies.