13-11-2012, 03:50 PM
BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing
BAR.pdf (Size: 662.61 KB / Downloads: 27)
INTRODUCTION
In recent years, large scale data processing has emerged
as an important part of state-of-the-art internet applications
such as search engines, online map services and social networkings.
These applications not only handle vast amount
of data, but also generate a large quantity of data everyday.
Cloud computing systems like MapReduce[1], Hadoop[2] and
Dryad[3] which are based on simplified parallel programming
models, have been designed for data-intensive applications.
For example, Facebook’s Hadoop data warehouse stores more
than 15PB of data(2.5PB after compression); on a single day,
more than 10,000 jobs are submitted to process a large amount
of data, meanwhile more than 60TB of new data(10TB after
compression) are loaded[4].
RELATED WORK
A. Data-aware scheduling on distributed systems
Over the past decade, data-intensive applications are
emerged as an important part of distributed computing. Meanwhile
considerable work has been done on data-aware scheduling
on distributed systems. Stork[12] is a specialized scheduler
for data placement and data movement in Grid. The main
idea of Stork is to map data close to computational resources.
Though Stork can be coupled with a computational task
scheduler, no attempt is made to use data locality to reduce
data transfer cost. The Gfarm[13] architecture is designed for
petascale data-intensive computing. Their model specifically
targets applications where the data primarily consist of a set
of records or objects which are analysed independently. In
Gfarm, several greedy scheduling algorithms are implemented
to improve data locality. However these algorithms do not
take account of the global optimization of all tasks.
PROBLEM FORMALIZATION
In this section, the system model is formalized. We consider
scheduling a set of independent tasks on a homogeneous
platform. As shown in Fig. 1, there are m(m = 7) tasks
and n(n = 3) servers, where each task processes an input
block on a server. On one hand, as input blocks are fixed-size,
we assume that data-local tasks take identical constant local
cost. On the other hand, as a larger remote task number will
cause a higher network contention, remote cost is increased
when the remote task number become larger. A job is not
completed until all tasks are finished. In addition, we take
account of cluster workload: at the start time, if most servers
are idle, the cluster is underloaded; in an overloaded cluster,
many servers can not be idle in a short time. Base on these
assumptions, our goal is to find an allocation strategy that
minimizes the job completion time.