06-09-2012, 03:54 PM
Comparing High Level MapReduce Query Languages
Comparing High Level.pdf (Size: 439.4 KB / Downloads: 23)
Abstract.
The MapReduce parallel computational model is of increasing
importance. A number of High Level Query Languages (HLQLs) have
been constructed on top of the Hadoop MapReduce realization, primarily
Pig, Hive, and JAQL. This paper makes a systematic performance
comparison of these three HLQLs, focusing on scale up, scale out and
runtime metrics. We further make a language comparison of the HLQLs
focusing on conciseness and computational power. The HLQL development
communities are engaged in the study, which revealed technical
bottlenecks and limitations described in this document, and it is impacting
their development.
Introduction
The MapReduce model proposed by Google [8] has become a key data processing
model, with a number of realizations including the open source Hadoop [3]
implementation. A number of HLQLs have been constructed on top of Hadoop
to provide more abstract query facilities than using the low-level Hadoop Java
based API directly. Pig [18], Hive [24], and JAQL [2] are all important HLQLs.
This paper makes a systematic investigation of the HLQLs. We investigate
specifically, whether the HLQLs are indeed more abstract: that is, how much
shorter are the queries in each HLQL compared with direct use of the API?
What performance penalty do the HLQLs pay to provide more abstract queries?
How expressive are the HLQLs - are they relationally complete, SQL equivalent,
or even Turing complete? More precisely, the paper makes the following research
contributions with respect to Pig, Hive, and JAQL.
Hadoop
Hadoop [3] is an Apache open source MR implementation, which is well suited
for use in large data warehouses, and indeed has gained traction in industrial
datacentres at Yahoo, Facebook and IBM. The software stack of Hadoop is packaged
with a set of complimentary services, and higher level abstractions from
MR. The core elements of Hadoop however, are MapReduce - the distributed
data processing model and execution environment; and the Hadoop Distributed
Filesystem (HDFS) - a distributed filesystem that runs on large clusters. The
HDFS provides high throughput access to application data, is suitable for applications
that have large data sets
High Level Query Languages
Justifications for higher level query languages over the MR paradigm are presented
in [15]. It outlines the lack of support that MR provides for complex
N-step dataflows, that often arise in real-world data analysis scenarios. In addition,
explicit support for multiple data sources is not provided by MR. A
number of HLQLs have been developed on top of Hadoop, and we review Pig
[18], Hive [24], and JAQL [2] in comparison with raw MapReduce. Their relationship
to Hadoop is depicted in Figure 1. Programs written in these languages
are compiled into a sequence of MapReduce jobs, to be executed in the Hadoop
MapReduce environment
HLQL Comparison
Language Design. The language design motivations are reflected by the contrasting
features of each high level query language. Hive provides Hive QL, a
SQL like language, presenting a declarative language (Listing 1.3). Pig by comparison
provides Pig Latin (Listing 1.2), a dataflow language influenced by both
the declarative style of SQL (it includes SQL like functions), and also the more
procedural MR (Listing 1.1). Finally, JAQL is a functional, higher-order programming
language, where functions may be assigned as variables, and later
evaluated (Listing 1.4). In contrast, Pig and Hive are strictly evaluated during
the compilation process, to identify type errors prior to runtime