08-12-2012, 01:47 PM
Secure Data Objects Replication in Data Grid
1Secure Data.pdf (Size: 2.07 MB / Downloads: 23)
Abstract
Secret sharing and erasure coding-based approaches have been used in distributed storage systems to ensure the
confidentiality, integrity, and availability of critical information. To achieve performance goals in data accesses, these data
fragmentation approaches can be combined with dynamic replication. In this paper, we consider data partitioning (both secret sharing
and erasure coding) and dynamic replication in data grids, in which security and data access performance are critical issues. More
specifically, we investigate the problem of optimal allocation of sensitive data objects that are partitioned by using secret sharing
scheme or erasure coding scheme and/or replicated. The grid topology we consider consists of two layers. In the upper layer, multiple
clusters form a network topology that can be represented by a general graph. The topology within each cluster is represented by a tree
graph. We decompose the share replica allocation problem into two subproblems: the Optimal Intercluster Resident Set Problem
(OIRSP) that determines which clusters need share replicas and the Optimal Intracluster Share Allocation Problem (OISAP) that
determines the number of share replicas needed in a cluster and their placements. We develop two heuristic algorithms for the two
subproblems. Experimental studies show that the heuristic algorithms achieve good performance in reducing communication cost and
are close to optimal solutions.
INTRODUCTION
DATA grid is a distributed computing architecture that
integrates a large number of data and computing
resources into a single virtual data management system [2].
It enables the sharing and coordinated use of data from
various resources and provides various services to fit the
needs of high-performance distributed and data-intensive
computing. Many data grid applications are being developed
or proposed, such as DoD’s Global Information Grid
(GIG) for both business and military domains [8], NASA’s
Information Power Grid [22], GMESS Health-Grid for
medical services [7], data grids for Federal Disaster Relief
[33], etc. These data grid applications are designed to
support global collaborations that may involve large
amount of information, intensive computation, real time,
or nonreal time communication. Success of these projects
can help to achieve significant advances in business,
medical treatment, disaster relief, research, and military
and can result in dramatic benefits to the society.
SYSTEM MODEL AND PROBLEM SPECIFICATION
In this paper, we consider achieving secure, survivable, and
high-performance data storage in data grids. To facilitate
scalability, we model the peer-to-peer data grid as a twolevel
topology (shown in Fig. 1). Studies show that the
internet can be decomposed into interconnected autonomous
systems [11], [23]. One or several such autonomous
systems that are geographically close to each other can be
considered as a cluster. The system consists of M clusters,
H1; . . .;HM, which are linked together and form a general
graph topology GC ¼ ðHC; ECÞ. Here, HC ¼ fH1; . . .;HMg
and EC is the set of edges connecting the clusters. Each edge
represents a logical link which may be multiple hops of the
physical links. It is likely that the clusters are linked to the
backbone and should be modeled by a general graph.
EXPERIMENTAL STUDY
We conduct experiments to evaluate the performance of the
heuristic algorithms for secret share allocation. The OIRSP
heuristic algorithm is compared with the randomized
K-replication, no-replication allocation, and complete replication
strategies, to study its performance and effectiveness
on reducing communication cost at the cluster level. In the
randomized K-replication, share replicas are randomly
allocated among K clusters, where K is the number of
clusters holding replicas computed by the OIRSP heuristic
algorithm. In the complete replication strategy, share
replicas are allocated in every cluster. In the no-replication
allocation strategy, there is no replication in any cluster.
The SDP-tree algorithm for OISAP is compared with
the optimal allocation algorithm and randomized
M-replication, to see how well the SDP-tree algorithm
performs in terms of reducing communication cost within
a cluster. In the randomized M-replication, M shares are
randomly allocated among the nodes in a cluster, where
M is computed by SDP-tree and it is the number of nodes
that hold shares.
STORAGE LIMITATIONS AND LOAD BALANCING
Replication is a natural solution for reducing the communication
cost (as we have discussed) as well as sharing the
access load. In peer-to-peer data grids, replica can be placed
on widely distributed nodes to achieve better access
performance and load sharing. In cluster-based data grids,
caching data on widely distributed nodes is necessary (in
addition to replication on cluster nodes) to achieve
improved access performance and load sharing. Data
partitioning can contribute to reduced storage cost. It has
been shown that erasure coding-based schemes can greatly
reduce the overall storage cost and effectively share the
storage consumption [16], [24], [37]. However, with these
schemes, it is still possible to have unbalanced access load
or storage requirements due to very unbalanced access
patterns. There may be too many requests inside a cluster or
a server inside a cluster may be a hot spot for many data
objects. In these cases, it is necessary to adapt the algorithms
proposed in this paper to bound the load and storage cost.
RELATED WORK
Replication techniques are frequently used to improve data
availability and reduce client response time and communication
cost. One major advantage of replication is performance
improvement, which is achieved by moving data objects close
to clients. In full replication [34], [35], all servers keep a
complete set of the data objects. This is frequently not feasible
for a large data set. More importantly, full replication
schemes might incur unnecessary communication overhead
whenboth readand update accesses are considered. In partial
replication, each server maintains a subset of data objects,
depending on the tradeoff of the read and update costs. For
both full and partial replication, an important issue is where
to place the replicas.Manyresearch efforts have been devoted
to the optimal placement of distributed files, data objects, and
other network services. Facility location [12] problem is one
big branch. It considers the optimal placement of one file or
data object (set) and supports read operations only. Many
research works are done along this direction, such as
p-median problem [13] and the approximate p-median
problem [1]. Another branch of research on optimal data
placement is the file allocation (optimal resident set) problem
[12]. It considers both read and update operations. A lot of
research efforts have been devoted to this research branch, the
optimal file allocation problem in tree networks, general
graph networks, completely connected networks, and ring
networks, respectively [34], the ADR algorithm [35], and the
optimal placement of k replicas in a tree network [12].
CONCLUSION AND FUTURE RESEARCH
We have combined data partitioning schemes (secret sharing
scheme or erasure coding scheme) with dynamic replication
to achieve data survivability, security, and access performance
in data grids. The replicas of the partitioned data need
to be properly allocated to achieve the actual performance
gains. We have developed algorithms to allocate correlated
data shares in large-scale peer-to-peer data grids. To support
scalability, we represent the data grid as a two-level clusterbased
topology and decompose the allocation problem into
two subproblems: the OIRSP and OISAP. The OIRSP
determines which clusters need to maintain share replicas,
and the OISAP determines the number of share replicas
needed in a cluster and their placements. Heuristic algorithms
are developed for the two subproblems. Experimental
studies show that the heuristic algorithms achieve good
performance in reducing communication cost and are close to
optimal solutions.