07-07-2014, 10:47 AM
Load Rebalancing
for Distributed File Systems in Clouds
Load Rebalancing.pdf (Size: 640.48 KB / Downloads: 111)
Abstract
—Distributed file systems are key building blocks for
cloud computing applications based on the MapReduce programming
paradigm. In such file systems, nodes simultaneously serve
computing and storage functions; a file is partitioned into a number
of chunks allocated in distinct nodes so that MapReduce tasks can
be performed in parallel over the nodes. However, in a cloud
computing environment, failure is the norm, and nodes may be
upgraded, replaced, and added in the system. Files can also be
dynamically created, deleted, and appended. This results in load
imbalance in a distributed file system; that is, the file chunks are
not distributed as uniformly as possible among the nodes. Emerging
distributed file systems in production systems strongly depend
on a central node for chunk reallocation. This dependence is clearly
inadequate in a large-scale, failure-prone environment because the
central load balancer is put under considerable workload that is
linearly scaled with the system size, and may thus become the
performance bottleneck and the single point of failure. In this
paper, a fully distributed load rebalancing algorithm is presented to
cope with the load imbalance problem. Our algorithm is compared
against a centralized approach in a production system and a
competing distributed solution presented in the literature. The
simulation results indicate that our proposal is comparable with
the existing centralized approach and considerably outperforms
the prior distributed algorithm in terms of load imbalance factor,
movement cost, and algorithmic overhead. The performance of
our proposal implemented in the Hadoop distributed file system
is further investigated in a cluster environment.
Index Terms—Load balance, Distributed file systems, Clouds
I. INTRODUCTION
Cloud computing (or cloud for short) is a compelling technology.
In clouds, clients can dynamically allocate their resources
on-demand without sophisticated deployment and management
of resources. Key enabling technologies for clouds include the
MapReduce programming paradigm [1], distributed file systems
(e.g., [3], [4]), virtualization (e.g., [4], [5]), and so forth. These
techniques emphasize scalability, so clouds (e.g., [6]) can be
large in scale, and comprising entities can arbitrarily fail and
join while maintaining system reliability.
Distributed file systems are key building blocks for cloud
computing applications based on the MapReduce programming
paradigm. In such file systems, nodes simultaneously serve
computing and storage functions; a file is partitioned into a
†Corresponding author. Department of Computer Science and Information
Engineering, National Cheng Kung University, Tainan 701, Taiwan, hchsiao@
csie.ncku.edu.tw.
¶Department of Computer Science and Information Engineering, National
Cheng Kung University, Tainan 701, Taiwan, p7697138[at]mail.ncku.edu.tw.
‡Department of Electrical and Computer Engineering, Clemson University,
Clemson SC 29634, USA, shenh[at]clemson.edu.
§Digital Home Systems Technology Department, Home Networking Technology
Center, Industrial Technology Research Institute South, Tainan 709, Taiwan,
ycchao[at]itri.org.tw.
number of chunks allocated in distinct nodes so that MapReduce
tasks can be performed in parallel over the nodes. For example,
consider a wordcount application that counts the number of
distinct words and the frequency of each unique word in a
large file. In such an application, a cloud partitions the file
into a large number of disjointed and fixed-size pieces (or
file chunks) and assigns them to different cloud storage nodes
(i.e., chunkservers). Each storage node (or node for short) then
calculates the frequency of each unique word by scanning and
parsing its local file chunks.
In such a distributed file system, the load of a node is
typically proportional to the number of file chunks the node
possesses [4]. Because the files in a cloud can be arbitrarily
created, deleted, and appended, and nodes can be upgraded,
replaced and added in the file system [7], the file chunks are
not distributed as uniformly as possible among the nodes. Load
balance among storage nodes is a critical function in clouds. In
a load-balanced cloud, the resources can be well utilized and
provisioned, maximizing the performance of MapReduce-based
applications.
State-of-the-art distributed file systems (e.g., Google GFS [3]
and Hadoop HDFS [4]) in clouds rely on central nodes to manage
the metadata information of the file systems and to balance
the loads of storage nodes based on that metadata. The centralized
approach simplifies the design and implementation of
a distributed file system. However, recent experience (e.g., [8])
concludes that when the number of storage nodes, the number
of files and the number of accesses to files increase linearly,
the central nodes (e.g., the master in Google GFS) become a
performance bottleneck, as they are unable to accommodate a
large number of file accesses due to clients and MapReduce
applications. Thus, depending on the central nodes to tackle the
load imbalance problem exacerbate their heavy loads. Even with
the latest development in distributed file systems, the central
nodes may still be overloaded. For example, HDFS federation
[15] suggests an architecture with multiple namenodes (i.e.,
the nodes managing the metadata information). Its file system
namespace is statically and manually partitioned to a number
of namenodes. However, as the workload experienced by the
namenodes may change over time and no adaptive workload
consolidation and/or migration scheme is offered to balance the
loads among the namenodes, any of the namenodes may become
the performance bottleneck.
In this paper, we are interested in studying the load rebalancing
problem in distributed file systems specialized for
large-scale, dynamic and data-intensive clouds. (The terms
“rebalance” and “balance” are interchangeable in this paper.)
Such a large-scale cloud has hundreds or thousands of nodes
Digital Object Indentifier 10.1109/TPDS.2012.196 1045-9219/12/$31.00 © 2012 IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2
(and may reach tens of thousands in the future). Our objective
is to allocate the chunks of files as uniformly as possible among
the nodes such that no node manages an excessive number
of chunks. Additionally, we aim to reduce network traffic (or
movement cost) caused by rebalancing the loads of nodes as
much as possible to maximize the network bandwidth available
to normal applications. Moreover, as failure is the norm, nodes
are newly added to sustain the overall system performance [3],
[4], resulting in the heterogeneity of nodes. Exploiting capable
nodes to improve the system performance is thus demanded.
Specifically, in this study we suggest offloading the load rebalancing
task to storage nodes by having the storage nodes balance
their loads spontaneously. This eliminates the dependence
on central nodes. The storage nodes are structured as a network
based on distributed hash tables (DHTs), e.g., [18], [19], [22];
discovering a file chunk can simply refer to rapid key lookup
in DHTs, given that a unique handle (or identifier) is assigned
to each file chunk. DHTs enable nodes to self-organize and -
repair while constantly offering lookup functionality in node
dynamism, simplifying the system provision and management.
In summary, our contributions are threefold as follows:
• By leveraging DHTs, we present a load rebalancing algorithm
for distributing file chunks as uniformly as possible
and minimizing the movement cost as much as
possible. Particularly, our proposed algorithm operates in
a distributed manner in which nodes perform their load
balancing tasks independently without synchronization or
global knowledge regarding the system.
• Load balancing algorithms based on DHTs have been
extensively studied (e.g., [27]–[30], [33]–[38]). However,
most existing solutions are designed without considering
both movement cost and node heterogeneity and may
introduce significant maintenance network traffic to the
DHTs. In contrast, our proposal not only takes advantage
of physical network locality in the reallocation of file
chunks to reduce the movement cost but also exploits
capable nodes to improve the overall system performance.
Additionally, our algorithm reduces algorithmic overhead
introduced to the DHTs as much as possible.
• Our proposal is assessed through computer simulations.
The simulation results indicate that although each node
performs our load rebalancing algorithm independently
without acquiring global knowledge, our proposal is comparable
with the centralized approach in Hadoop HDFS [4]
and remarkably outperforms the competing distributed
algorithm in [33] in terms of load imbalance factor,
movement cost, and algorithmic overhead. Additionally,
our load balancing algorithm exhibits a fast convergence
rate. We derive analytical models to validate the efficiency
and effectiveness of our design. Moreover, we have implementation
our load balancing algorithm in HDFS and
investigated its performance in a cluster environment.
The remainder of the paper is organized as follows. The load
rebalancing problem is formally specified in Section II. Our load
balancing algorithm is presented in Section III. We evaluate
our proposal through computer simulations and discuss the
simulation results in Section IV. In Section V, the performance
of our proposal is further investigated in a cluster environment.
Our study is summarized in Section VI. Due to space limitation,
we defer the extensive discussion of related works in the
appendix.
IMPLEMENTATION AND MEASUREMENT
A. Experimental Environment Setup
We have implemented our proposal in Hadoop HDFS 0.21.0,
and assessed our implementation against the load balancer in
HDFS. Our implementation is demonstrated through a smallscale
cluster environment (Fig. 11(a)) consisting of a single,
dedicated namenode and 25 datanodes, each with Ubuntu
10.10 [34]. Specifically, the namenode is equipped with Intel
Core 2 Duo E7400 processor and 3 Gbytes RAM. As the
number of file chunks in our experimental environment is small,
the RAM size of the namenode is sufficient to cache the entire
namenode process and the metadata information, including the
directories and the locations of file chunks.
In the experimental environment, a number of clients are
established to issue requests to the namenode. The requests include
commands to create directories with randomly designated
names, to remove directories arbitrarily chosen, etc. Due to the
scarce resources in our environment, we have deployed 4 clients
to generate requests to the namenode. However, this cannnot
overload the namenode to mimic the situation as reported
in [8]. To emulate the load of the namenode in a production
system and investigate the effect of the namenode’s load on
the performance of a load balancing algorithm, we additionally
limit the processor cycles available to the namenode by varying
the maximum processor utilization, denoted byM, available to
the namenode up to M = 1%, 2%, 8%, 16%, 32%, 64%, 99%.
The lower processor availability to the namenode represents the
less CPU cycles that the namenode can allocate to handle the
clients’ requests and to talk to the load balancer.
As data center networks proposed recently (e.g., [29]) can
offer a fully bisection bandwidth, the total number of chunks
scattered in the file system in our experiments is limited to 256
such that the network bandwidth in our environment (i.e., all
nodes are connected with a 100 Mbps fast Ethernet swtich) is
not the performance bottleneck. Particularly, the size of a file
chunk in the experiments is set to 16 Mbytes. Compared to
each experimental run requiring 20 ∼ 60 minutes, transferring
these chunks takes no more than 16×256×8
100
≈ 328 seconds ≈
5.5 minutes in case the network bandwidth is fully utilized. The
initial placement of the 256 file chunks follows the exponential
distribution as discussed in Section IV.
For each experimental run, we quantity the time elapsed to
complete the load balancing algorithms, including the HDFS
load balancer and our proposal. We perform 20 runs for a given
Mand average the time required for executing a load balancing
algorithm. Additionally, the 5- and 95-percentiles are reported.
For our proposal, we let ΔU = ΔL = 0.2. Each datanode
performs 10 random samples.
Note that (1) in the experimental results discussed later,
we favor HDFS by dedicating a standalone node to perform
the HDFS load balancing function. By contrast, our proposal
excludes the extra, standalone node. (2) The datanodes in our
cluster environment are homogeneous, each with Intel Celeron
430 and 3 Gbytes RAM. We thus do not study the effect of
the node heterogeneity on our proposal. (3) We also do not
investigate the effect of network locality on our proposal as the
nodes in our environment are only linked with a single switch.
B. Experimental Results
We demonstrate in Fig. 11 the experimental results. Fig. 11(b)
shows the time required for performing the HDFS load balancer
and our proposal. Our proposal clearly outperforms the HDFS
load balancer. When the namenode is heavily loaded (i.e.,
small M’s), our proposal remarkably performs better than the
HDFS load balancer. For example, if M = 1%, the HDFS
load balancer takes approximately 60 minutes to balance the
loads of datanodes. By contrast, our proposal takes nearly 20
minutes in the case ofM= 1%. Specifically, unlike the HDFS
load balancer, our proposal is independent of the load of the
namenode.
In Figs. 11© and (d), we further show the distributions of
chunks after performing the HDFS load balancer and our proposal.
As there are 256 file chunks and 25 datanodes, the ideal
number of chunks that each datanode needs to host is 256
25
≈ 10.
Due to space limitation, we only offer the experimental results
for M = 1 and the results for M = 1 conclude the similar.
Figs. 11© and (d) indicate that our proposal is comparable to
the HDFS load balancer, and balances the loads of datanodes,
effectively.
SUMMARY
A novel load balancing algorithm to deal with the load
rebalancing problem in large-scale, dynamic, and distributed file
systems in clouds has been presented in this paper. Our proposal
strives to balance the loads of nodes and reduce the demanded
movement cost as much as possible, while taking advantage
of physical network locality and node heterogeneity. In the
absence of representative real workloads (i.e., the distributions
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10
(a)
0
10
20
30
40
50
60
0 20 40 60 80 100
Time (Minutes)
Available Processor Utilization (M)
Hadoop
Ours
(b)
0
5
10
15
20
0 5 10 15 20 25
Number of Chunks
Datanode ID
Hadoop
©
0
5
10
15
20
0 5 10 15 20 25
Number of Chunks
Datanode ID
Ours
(d)
Fig. 11. The experimental environment and performance results, where (a) shows the setup of the experimental environment, (b) indicates the time elapsed
of performing the HDFS load balancer and our proposal, and © and (d) show the distributions of file chunks for the HDFS load balancer and our proposal,
respectively
of file chunks in a large-scale storage system) in the public
domain, we have investigated the performance of our proposal
and compared it against competing algorithms through synthesized
probabilistic distributions of file chunks. The synthesis
workloads stress test the load balancing algorithms by creating
a few storage nodes that are heavily loaded. The computer
simulation results are encouraging, indicating that our proposed
algorithm performs very well. Our proposal is comparable
to the centralized algorithm in the Hadoop HDFS production
system and dramatically outperforms the competing distributed
algorithm in [33] in terms of load imbalance factor, movement
cost, and algorithmic overhead. Particularly, our load balancing
algorithm exhibits a fast convergence rate. The efficiency and
effectiveness of our design are further validated by analytical
models and a real implementation with a small-scale cluster
environment.