Data Replication in Data Intensive Scientific Applications with Performance

**project girl** · 30-01-2013, 12:21 PM

Data Replication in Data Intensive Scientific Applications with Performance Guarantee

.pdf

1Data Replication.pdf (Size: 387.52 KB / Downloads: 20)

Abstract

Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and
bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive
applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the
previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no
theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical
performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time
centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication
solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a
distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms.
Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other
intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the
distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable
and adaptive to the dynamic change of file access patterns in Data Grids.

INTRODUCTION

DATA intensive scientific applications, which mainly aim
to answer some of the most fundamental questions
facing human beings, are becoming increasingly prevalent in
a wide range of scientific and engineering research domains.
Examples include human genome mapping [38], highenergy
particle physics and astronomy [1], [24], and climate
change modeling [30]. In such applications, large amounts of
data sets are generated, accessed, and analyzed by scientists
worldwide. The Data Grid [46], [4], [20] is an enabling
technology for data intensive applications. It is composed of
hundreds of geographically distributed computation, storage,
and networking resources to facilitate data sharing and
management in data intensive applications. One distinct
feature of Data Grids is that they produce and manage very
large amount of data sets, in the order of terabytes and
petabytes. For example, the Large Hadron Collider (LHC) at
the European Organization for Nuclear Research (CERN)
near Geneva, Switzerland, is the largest scientific instrument
on the planet.

RELATED WORK

Replication has been an active research topic for many years
in World Wide Web [33], peer-to-peer networks [3], ad hoc
and sensor networking [23], [41], and mesh networks [26].
In Data Grids, enormous scientific data and complex
scientific applications call for new replication algorithms,
which have attracted much research recently.
The most closely related work to ours is by Cibej et al.
[44]. The authors study data replication on Data Grids as a
static optimization problem. They show that this problem is
NP-hard and nonapproximable, which means that there is
no polynomial algorithm that provides an approximation
solution if P 6¼ NP. The authors discuss two solutions:
integer programming and simplifications. They only consider
static data replication for the purpose of formal
analysis. The limitation of the static approach is that the
replication cannot adjust to the dynamically changing user
access pattern. Furthermore, their centralized integer programming
technique cannot be easily implemented in a
distributed Data Grid. Moreover, Baev et al. [5], [6] show
that if all the data have uniform size, then this problem is
indeed approximable. And they find 20.5-approximation
and 10-approximation algorithms. However, their approach,
which is based on rounding an optimal solution
to the linear programming relaxation of the problem, cannot
be easily implemented in a distributed way. In this work,
we follow the same direction (i.e., uniform data size), but
design a polynomial time approximation algorithm, which
can also be easily implemented in a distributed environment
like Data Grids.

DISTRIBUTED DATA CACHING ALGORITHM IN
DATA GRIDS

In this section, we design a localized distributed caching
algorithm based on the centralized algorithm. In the
distributed algorithm, each Grid site observes the local Data
Grid traffic to make an intelligent caching decision. Our
distributed caching algorithm is advantageous since it does
not need global information such as the network topology of
the Grid, and it is more reactive to network states such as the
file distribution, user access pattern, and job distribution in
the Data Grids. Therefore, our distributed algorithm can
adopt well to such dynamic changes in the Data Grids.
The distributed algorithm is composed of two important
components: nearest replica catalog (NRC) maintained at
each site and a localized data caching algorithm running at
each site. Again, as stated in Section 3, the top level site
maintains a Centralized Replica Catalogue (CRC), which is
essentially a list of replica site list Cj for each data file Dj. The
replica site list Cj contains the set of sites (including source
site Sj) that has a copy of Dj.

CONCLUSIONS AND FUTURE WORK

In this article, we study how to replicate data files in data
intensive scientific applications, to reduce the file access
time with the consideration of limited storage space of Grid
sites. Our goal is to effectively reduce the access time of data
files needed for job executions at Grid sites. We propose a
centralized greedy algorithm with performance guarantee

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	Design and Implementation of High-Performance FPGA Signal Processing Datapaths	seminar class	1	300,513	20-09-2017, 01:31 PM Last Post: jaseela123
	Detecting False Data in Wireless Sensor Network using Efficient Becan Scheme	seminar tips	1	3,235	20-09-2017, 01:03 PM Last Post: jaseela123
	Different Initialization Data and the Performance by the BFM	seminar flower	1	680	20-09-2017, 12:44 PM Last Post: jaseela123
	Wide Area Mobile Data Services	seminar ideas	1	2,373	19-09-2017, 02:35 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	DESIGN AND PERFORMANCE ANALYSIS OF OPTICAL CDMA SYSTEM USING NEWLY DESIGNED MULTIWAVE	project girl	1	1,270	15-09-2017, 01:34 PM Last Post: jaseela123
	Survey of Privacy Protection for Medical Data	project maker	1	649	13-09-2017, 01:14 PM Last Post: jaseela123
	Using Rapid Prototyping Data to Enhance a Knowledge-Based Framework for Product Redes	smart paper boy	1	115,120	13-09-2017, 09:54 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.