05-05-2012, 12:11 PM
Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud
cloud.pdf (Size: 551.14 KB / Downloads: 35)
INTRODUCTION
Today a growing number of companies have to process
huge amounts of data in a cost-efficient manner. Classic
representatives for these companies are operators of
Internet search engines, like Google, Yahoo, or Microsoft.
The vast amount of data they have to deal with every
day has made traditional database solutions prohibitively
expensive [5]. Instead, these companies have
popularized an architectural paradigm based on a large
number of commodity servers. Problems like processing
crawled documents or regenerating a web index are split
into several independent subtasks, distributed among
the available nodes, and computed in parallel.
CHALLENGES AND OPPORTUNITIES
Current data processing frameworks like Google’s
MapReduce or Microsoft’s Dryad engine have been designed
for cluster environments. This is reflected in a
number of assumptions they make which are not necessarily
valid in cloud environments. In this section we
discuss how abandoning these assumptions raises new
opportunities but also challenges for efficient parallel
data processing in clouds.
Opportunities
Today’s processing frameworks typically assume the resources
they manage consist of a static set of homogeneous
compute nodes. Although designed to deal with individual
nodes failures, they consider the number of available
machines to be constant, especially when scheduling
the processing job’s execution. While IaaS clouds can
certainly be used to create such cluster-like setups, much
of their flexibility remains unused.
Challenges
The cloud’s virtualized nature helps to enable promising
new use cases for efficient parallel data processing. However,
it also imposes new challenges compared to classic
cluster setups. The major challenge we see is the cloud’s
opaqueness with prospect to exploiting data locality:
DESIGN
Based on the challenges and opportunities outlined
in the previous section we have designed Nephele, a
new data processing framework for cloud environments.
Nephele takes up many ideas of previous processing
frameworks but refines them to better match the dynamic
and opaque nature of a cloud.