13-08-2012, 03:39 PM
A Platform for Scalable One-Pass Analytics using MapReduce
A Platform for Scalable.pdf (Size: 1.24 MB / Downloads: 34)
ABSTRACT
Today’s one-pass analytics applications tend to be data-intensive in
nature and require the ability to process high volumes of data efficiently.
MapReduce is a popular programming model for processing
large datasets using a cluster of machines. However, the traditional
MapReduce model is not well-suited for one-pass analytics, since
it is geared towards batch processing and requires the data set to
be fully loaded into the cluster before running analytical queries.
This paper examines, from a systems standpoint, what architectural
design changes are necessary to bring the benefits of the MapReduce
model to incremental one-pass analytics. Our empirical and
theoretical analyses of Hadoop-based MapReduce systems show
that the widely-used sort-merge implementation for partitioning
and parallel processing poses a fundamental barrier to incremental
one-pass analytics, despite various optimizations. To address these
limitations, we propose a new data analysis platform that employs
hash techniques to enable fast in-memory processing, and a new frequent
key based technique to extend such processing to workloads
that require a large key-state space.
INTRODUCTION
Today, real-time analytics on large, continuously-updated datasets
has become essential to meet many enterprise business needs. Like
traditional warehouse applications, real-time analytics using incremental
one-pass processing tends to be data-intensive in nature and
requires the ability to collect and analyze enormous datasets efficiently.
At the same time, MapReduce has emerged as a popular
model for parallel processing of large datasets using a commodity
cluster of machines. The key benefits of this model are that it harnesses
compute and I/O parallelism on commodity hardware and can
easily scale as the datasets grow in size. However, the MapReduce
model is not well-suited for incremental one-pass analytics since it is
primarily designed for batch processing of queries on large datasets.
Furthermore, MapReduce implementations require the entire data
set to be loaded into the cluster before running analytical queries,
thereby incurring long latencies and making them unsuitable for
producing incremental results.
Common MapReduce Implementations
Hadoop. We first consider Hadoop, the most popular open-source
implementation of MapReduce. Hadoop uses block-level scheduling
and a sort-merge technique [21] to implement the group-by
functionality for parallel processing (Google’s MapReduce system
is reported to use a similar implementation [7], but further details
are lacking due to the use of proprietary code).
The Hadoop Distributed File System (HDFS) handles the reading
of job input data and writing of job output data. The unit of data
storage in HDFS is a 64MB block by default and can be set to
other values during configuration. These blocks serve as the task
granularity for MapReduce jobs.
Given a query job, several map tasks (mappers) and reduce tasks
(reducers) are started to run concurrently on each node. As Fig. 1
shows, each mapper reads a chunk of input data, applies the map
function to extract key, value pairs, then assigns these data items to
partitions that correspond to different reducers, and finally sorts the
data items in each partition by the key. Hadoop currently performs
a sort on the compound partition, key to achieve both partitioning
and sorting in each partition. Given the relatively small block size, a
properly-tuned buffer will allow such sorting to complete in memory.
Then the sorted map output is written to disk for fault tolerance. A
mapper completes after the write finishes.
Summary of Benchmarking Results
The requirements for scalable streaming analytics—incremental
processing and fast in-memory processing whenever possible—require
the MapReduce program of a query to be non-blocking and have
low CPU and I/O overheads. In our recent benchmarking study, we
examined whether current MapReduce systems meet these requirements.
We considered applications such as click stream analysis and
web document analysis in our benchmark. Due to space constraints,
we mainly report results on click stream analysis in this section.
Given a click stream, an important task is sessionization that
reorders page clicks into individual user sessions. In its MapReduce
program, the map function extracts the user id from each click and
groups the clicks by user id.
An Analytical Model for Hadoop
The Hadoop system has a large number of parameters. While our
previous experiments used the default settings, we examine these
parameters more carefully in this study. After a nearly year-long
effort to experiment with Hadoop, we identified several parameters
that impact performance from the standpoint of incremental onepass
analytics, which are listed in Part (1) of Table 2. Our analysis
below focuses on the effects of these parameters on I/O and startup
costs. We do not aim to model the actual running time because it
depends on numerous factors such as the actual server configuration,
how map and reduces tasks are interleaved, how CPU and I/O
operations are interleaved, and even how simultaneous I/O requests
are served. Once we optimize these parameters based on our model,
we will evaluate performance empirically using the actual running
time and the progress with respect to incremental processing.
A NEW HASH-BASED PLATFORM
Based on the insights from our experimental and analytical evaluation
of current MapReduce systems, we next propose a new data
analysis platform that transforms MapReduce computation into incremental
one-pass processing. Our first mechanism replaces the
widely used sort-merge implementation for partitioning and parallel
processing with a purely hash-based framework to minimize
computational and I/O bottlenecks as well as blocking. Two hash
techniques, designed for different types of reduce functions, are
described in §4.1 and §4.2, respectively. These techniques enable
fast in-memory processing when there is sufficient memory for the
current workload. Our second mechanism further brings the benefits
of fast in-memory processing to workloads that require a large
key-state space that far exceeds available memory. Our technique
efficiently identifies popular keys and updates their states using a
full in-memory processing path. This mechanism is detailed in §4.3.
A Dynamic Incremental Hash Technique
Our last technique is an extension of the incremental hash approach
where we dynamically determine which keys should be
processed in memory and which keys will be written to disk for
subsequent processing. The basic idea behind the new technique
is to recognize hot keys that appear frequently in the data set and
hold their states in memory, hence providing incremental in-memory
processing for these keys. The benefits of doing so are two-fold.
First, prioritizing these keys leads to greater I/O efficiency since
in-memory processing of data items of hot keys can greatly decrease
the volume of data that needs to be first written to disks and then
read back to complete the processing. Second, it is often the case
that the answers for the hot keys are more important to the user than
the colder keys. Then this technique offers the user the ability to
terminate the processing before data is read back from disk if the
coverage of data is sufficiently large for those keys in memory.
RELATED WORK
Query Processing using MapReduce [5, 11, 16, 17, 19, 23] has
been a research topic of significant interest lately. To the best of our
knowledge, none of these systems support incremental one-pass analytics
as defined in our work. The closest work to ours is MapReduce
Online [6] which we discussed in detail in Sections 2 and 3. Dryad
[23] uses in-memory hashing to implement MapReduce group-by
but falls back on the sort-merge implementation when the data size
exceeds memory. Merge Reduce Merge [22] implements hash join
using a technique similar to our baseline MR-hash, but lacks further
implementation details. Several other projects are in parallel to our
work: The work in [2] focuses on optimizing Hadoop parameters
and ParaTimer [13] aims to provide an indicator of remaining time
of MapReduce jobs. Neither of them improves MapReduce for incremental
computation. Finally, many of the above systems support
concurrent MapReduce jobs to increase system resource utilization.
However, the resources consumed by each task will not reduce, and
concurrency does not help achieve one-pass incremental processing.
CONCLUSIONS
In this paper, we examined the architectural design changes that
are necessary to bring the benefits of the MapReduce model to
incremental one-pass analytics. Our empirical and theoretical analyses
showed that the widely-used sort-merge implementation for
MapReduce partitioned parallelism poses a fundamental barrier to
incremental one-pass analytics, despite optimizations. We proposed
a new data analysis platform that employs a purely hash-based framework,
with various techniques to enable incremental processing and
fast in-memory processing for frequent keys. Evaluation of our
Hadoop-based prototype showed that it can significantly improve
the progress of map tasks, allows the reduce progress to keep up
with the map progress with up to 3 orders of magnitude reduction of
internal data spills, and enables results to be returned early. In future
work, we will extend our one-pass analytics platform to support
a wider range of incremental computation tasks with minimized
I/O, online aggregation with early approximate answers, and stream
query processing with window operations.