30-01-2013, 12:21 PM
Data Replication in Data Intensive Scientific Applications with Performance Guarantee
1Data Replication.pdf (Size: 387.52 KB / Downloads: 20)
Abstract
Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and
bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive
applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the
previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no
theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical
performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time
centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication
solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a
distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms.
Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other
intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the
distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable
and adaptive to the dynamic change of file access patterns in Data Grids.
INTRODUCTION
DATA intensive scientific applications, which mainly aim
to answer some of the most fundamental questions
facing human beings, are becoming increasingly prevalent in
a wide range of scientific and engineering research domains.
Examples include human genome mapping [38], highenergy
particle physics and astronomy [1], [24], and climate
change modeling [30]. In such applications, large amounts of
data sets are generated, accessed, and analyzed by scientists
worldwide. The Data Grid [46], [4], [20] is an enabling
technology for data intensive applications. It is composed of
hundreds of geographically distributed computation, storage,
and networking resources to facilitate data sharing and
management in data intensive applications. One distinct
feature of Data Grids is that they produce and manage very
large amount of data sets, in the order of terabytes and
petabytes. For example, the Large Hadron Collider (LHC) at
the European Organization for Nuclear Research (CERN)
near Geneva, Switzerland, is the largest scientific instrument
on the planet.
RELATED WORK
Replication has been an active research topic for many years
in World Wide Web [33], peer-to-peer networks [3], ad hoc
and sensor networking [23], [41], and mesh networks [26].
In Data Grids, enormous scientific data and complex
scientific applications call for new replication algorithms,
which have attracted much research recently.
The most closely related work to ours is by Cibej et al.
[44]. The authors study data replication on Data Grids as a
static optimization problem. They show that this problem is
NP-hard and nonapproximable, which means that there is
no polynomial algorithm that provides an approximation
solution if P 6¼ NP. The authors discuss two solutions:
integer programming and simplifications. They only consider
static data replication for the purpose of formal
analysis. The limitation of the static approach is that the
replication cannot adjust to the dynamically changing user
access pattern. Furthermore, their centralized integer programming
technique cannot be easily implemented in a
distributed Data Grid. Moreover, Baev et al. [5], [6] show
that if all the data have uniform size, then this problem is
indeed approximable. And they find 20.5-approximation
and 10-approximation algorithms. However, their approach,
which is based on rounding an optimal solution
to the linear programming relaxation of the problem, cannot
be easily implemented in a distributed way. In this work,
we follow the same direction (i.e., uniform data size), but
design a polynomial time approximation algorithm, which
can also be easily implemented in a distributed environment
like Data Grids.
DISTRIBUTED DATA CACHING ALGORITHM IN
DATA GRIDS
In this section, we design a localized distributed caching
algorithm based on the centralized algorithm. In the
distributed algorithm, each Grid site observes the local Data
Grid traffic to make an intelligent caching decision. Our
distributed caching algorithm is advantageous since it does
not need global information such as the network topology of
the Grid, and it is more reactive to network states such as the
file distribution, user access pattern, and job distribution in
the Data Grids. Therefore, our distributed algorithm can
adopt well to such dynamic changes in the Data Grids.
The distributed algorithm is composed of two important
components: nearest replica catalog (NRC) maintained at
each site and a localized data caching algorithm running at
each site. Again, as stated in Section 3, the top level site
maintains a Centralized Replica Catalogue (CRC), which is
essentially a list of replica site list Cj for each data file Dj. The
replica site list Cj contains the set of sites (including source
site Sj) that has a copy of Dj.
CONCLUSIONS AND FUTURE WORK
In this article, we study how to replicate data files in data
intensive scientific applications, to reduce the file access
time with the consideration of limited storage space of Grid
sites. Our goal is to effectively reduce the access time of data
files needed for job executions at Grid sites. We propose a
centralized greedy algorithm with performance guarantee