27-11-2012, 05:32 PM
Load Rebalancing for Distributed File Systems in Clouds
1 Load Rebalancing for Distributed File Systems in Clouds.pdf (Size: 640.48 KB / Downloads: 196)
Abstract—
Distributed file systems are key building blocks for
cloud computing applications based on the MapReduce programming
paradigm. In such file systems, nodes simultaneously serve
computing and storage functions; a file is partitioned into a number
of chunks allocated in distinct nodes so that MapReduce tasks can
be performed in parallel over the nodes. However, in a cloud
computing environment, failure is the norm, and nodes may be
upgraded, replaced, and added in the system. Files can also be
dynamically created, deleted, and appended. This results in load
imbalance in a distributed file system; that is, the file chunks are
not distributed as uniformly as possible among the nodes. Emerging
distributed file systems in production systems strongly depend
on a central node for chunk reallocation. This dependence is clearly
inadequate in a large-scale, failure-prone environment because the
central load balancer is put under considerable workload that is
linearly scaled with the system size, and may thus become the
performance bottleneck and the single point of failure. In this
paper, a fully distributed load rebalancing algorithm is presented to
cope with the load imbalance problem. Our algorithm is compared
against a centralized approach in a production system and a
competing distributed solution presented in the literature. The
simulation results indicate that our proposal is comparable with
the existing centralized approach and considerably outperforms
the prior distributed algorithm in terms of load imbalance factor,
movement cost, and algorithmic overhead. The performance of
our proposal implemented in the Hadoop distributed file system
is further investigated in a cluster environment.
INTRODUCTION
Cloud computing (or cloud for short) is a compelling technology.
In clouds, clients can dynamically allocate their resources
on-demand without sophisticated deployment and management
of resources. Key enabling technologies for clouds include the
MapReduce programming paradigm [1], distributed file systems
(e.g., [3], [4]), virtualization (e.g., [4], [5]), and so forth. These
techniques emphasize scalability, so clouds (e.g., [6]) can be
large in scale, and comprising entities can arbitrarily fail and
join while maintaining system reliability.
Distributed file systems are key building blocks for cloud
computing applications based on the MapReduce programming
paradigm. In such file systems, nodes simultaneously serve
computing and storage functions; a file is partitioned into a
†Corresponding author. Department of Computer Science and Information
‡Department of Electrical and Computer Engineering, Clemson University,
§Digital Home Systems Technology Department, Home Networking Technology
Center, Industrial Technology Research Institute South, Tainan 709, Taiwan,
number of chunks allocated in distinct nodes so that MapReduce
tasks can be performed in parallel over the nodes. For example,
consider a wordcount application that counts the number of
distinct words and the frequency of each unique word in a
large file. In such an application, a cloud partitions the file
into a large number of disjointed and fixed-size pieces (or
file chunks) and assigns them to different cloud storage nodes
(i.e., chunkservers). Each storage node (or node for short) then
calculates the frequency of each unique word by scanning and
parsing its local file chunks.
SIMULATIONS
Simulation Setup and Workloads
The performance of our algorithm is evaluated through computer
simulations. Our simulator is implemented with Pthreads.
In the simulations, we carry out our proposal based on the Chord
DHT protocol [18] and the gossip-based aggregation protocol
in [26], [27]. In the default setting, the number of nodes in
the system is n = 1, 000, and the number of file chunks is
m = 10, 000. To the best of our knowledge, there are no
representative realistic workloads available. Thus, the number of
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
file chunks initially hosted by a node in our simulations follows
the geometric distribution, enabling stress tests as suggested
in [34] for various load rebalancing algorithms. Fig. 3 shows
the cumulative distribution functions (CDF) of the file chunks in
the simulations, where workloads A, B, C, and D represent four
distinct geometric distributions. Specifically, these distributions
indicate that a small number of nodes initially possess a
large number of chunks. The four workloads exhibit different
variations of the geometric distribution.
SUMMARY
A novel load balancing algorithm to deal with the load
rebalancing problem in large-scale, dynamic, and distributed file
systems in clouds has been presented in this paper. Our proposal
strives to balance the loads of nodes and reduce the demanded
movement cost as much as possible, while taking advantage
of physical network locality and node heterogeneity. In the
absence of representative real workloads (i.e., the distributions
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
The experimental environment and performance results, where (a) shows the setup of the experimental environment, (b) indicates the time elapsed
of performing the HDFS load balancer and our proposal, and © and (d) show the distributions of file chunks for the HDFS load balancer and our proposal,
respectively
of file chunks in a large-scale storage system) in the public
domain, we have investigated the performance of our proposal
and compared it against competing algorithms through synthesized
probabilistic distributions of file chunks. The synthesis
workloads stress test the load balancing algorithms by creating
a few storage nodes that are heavily loaded. The computer
simulation results are encouraging, indicating that our proposed
algorithm performs very well. Our proposal is comparable
to the centralized algorithm in the Hadoop HDFS production
system and dramatically outperforms the competing distributed
algorithm in [33] in terms of load imbalance factor, movement
cost, and algorithmic overhead. Particularly, our load balancing
algorithm exhibits a fast convergence rate. The efficiency and
effectiveness of our design are further validated by analytical
models and a real implementation with a small-scale cluster
environment.