23-05-2012, 10:45 AM
Heuristics Based Query Processing for Large RDF Graphs Using Cloud Computing
Heuristics Based Query Processing for Large RDF Graphs Using Cloud Computing.pdf (Size: 334.5 KB / Downloads: 81)
Abstract
Semantic Web is an emerging area to augment human reasoning. Various technologies are being developed in this arena
which have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework
(RDF). Semantic Web technologies can be utilized to build efficient and scalable systems for Cloud Computing. With the explosion of
semantic web technologies, large RDF graphs are common place. This poses significant challenges for the storage and retrieval of
RDF graphs. Current frameworks do not scale for large RDF graphs and as a result do not address these challenges. In this paper, we
describe a framework that we built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing
paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. More than one Hadoop job (the smallest unit
of execution in Hadoop) may be needed to answer a query because a single triple pattern in a query cannot simultaneously take part
in more than one join in a single Hadoop job. To determine the jobs, we present an algorithm to generate query plan, whose worst
case cost is bounded, based on a greedy approach to answer a SPARQL Protocol and RDF Query Language(SPARQL) query. We
use Hadoop’s MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters
built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large
amounts of RDF data, unlike traditional approaches.
Index Terms—Hadoop, RDF, SPARQL, MapReduce.
1 INTRODUCTION
Cloud computing is an emerging paradigm in the IT and
data processing communities. Enterprises utilize cloud
computing service to outsource data maintenance, which
can result in significant financial benefits. Businesses
store and access data at remote locations in the ”cloud”.
As the popularity of cloud computing grows, the service
providers face ever increasing challenges. They have to
maintain huge quantities of heterogenous data while
providing efficient information retrieval. Thus the key
emphasis for cloud computing solutions is scalability
and query efficiency.
Semantic Web technologies are being developed to
present data in standardized way such that such data
can be retrieved and understood by both human and
machine. Historically, web pages are published in plain
html files which are not suitable for reasoning. Instead,
the machine treats these html files as a bag of keywords.
Researchers are developing Semantic Web technologies
that have been standardized to address such inadequacies.
The most prominent standards are Resource
Description Framework1 (RDF) and SPARQL Protocol
and RDF Query Language2 (SPARQL). RDF is the standard
for storing and representing data and SPARQL is
a query language to retrieve data from an RDF store.
Cloud Computing systems can utilize the power of these
Semantic Web technologies to provide the user with
1. http://www.w3TR/rdf-primer
2. http://www.w3TR/rdf-sparql-query
capability to efficiently store and retrieve data for data
intensive applications.
Semantic web technologies could be especially useful
for maintaining data in the cloud. Semantic web
technologies provide the ability to specify and query
heterogenous data in a standardized manner. Moreover,
via OWL (Web Ontology Language) ontologies, different
schemas, classes, data types and relationships can be
specified without sacrificing the standard RDF/SPARQL
interface. Conversely, cloud computing solutions could
be of great benefit to the semantic web community.
Semantic web datasets are growing exponentially. More
than any other arena, in the web domain, scalability
is paramount. Yet, high speed response time is also
vital in the web community. We believe that the cloud
computing paradigm offers a solution that can achieve
both of these goals.
Existing commercial tools and technologies do not
scale well in Cloud Computing settings. Researchers
have started to focus on these problems recently. They
are proposing systems built from the scratch. In [39],
researchers propose an indexing scheme for a new distributed
database3 which can be used as a Cloud system.
When it comes to semantic web data such as RDF, we
are faced with similar challenges. With storage becoming
cheaper and the need to store and retrieve large amounts
of data, developing systems to handle billions of RDF
triples requiring tera bytes of disk space is no longer
3. http://www.comp.nus.edu.sg/∼epic/
Digital Object Indentifier 10.1109/TKDE.2011.103 1041-4347/11/$26.00 © 2011 IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2
a distant prospect. Researchers are already working
on billions of triples [30], [33]. Competitions are being
organized to encourage researchers to build efficient
repositories4. At present, there are just a few frameworks
(e.g. RDF-3X [29], Jena [7], Sesame5, BigOWLIM [22])
for Semantic Web technologies, and these frameworks
have limitations for large RDF graphs. Therefore, storing
a large number of RDF triples and efficiently querying
them is a challenging and important problem.
A distributed system can be built to overcome the scalability
and performance problems of current Semantic
Web frameworks. Databases are being distributed in order
to provide such scalable solutions. However, to date,
there is no distributed repository for storing and managing
RDF data. Researchers have only recently begun to
explore the problems and technical solutions which must
be addressed in order to build such a distributed system.
One promising line of investigation involves making
use of readily available distributed database systems
or relational databases. Such database systems can use
relational schema for the storage of RDF data. SPARQL
queries can be answered by converting them to SQL
first [9], [10], [12]. Optimal relational schemas are being
probed for this purpose [3]. The main disadvantage with
such systems is that they are optimized for relational
data. They may not perform well for RDF data, especially
because RDF data are sets of triples6 (an ordered
tuple of three components called subject, predicate and
object respectively) which form large directed graphs. In
a SPARQL query, any number of triple patterns7 can join
on a single variable8 which makes a relational database
query plan complex. Performance and scalability will
remain a challenging issue due to the fact that these
systems are optimized for relational data schemata and
transactional database usage.
Yet another approach is to build a distributed system
for RDF from scratch. Here, there will be an opportunity
to design and optimize a system with specific application
to RDF data. In this approach, the researchers would be
reinventing the wheel.
Instead of starting with a blank slate, we propose
to build a solution with a generic distributed storage
system which utilizes a Cloud Computing platform. We
then propose to tailor the system and schema specifically
to meet the needs of semantic web data. Finally, we
propose to build a semantic web repository using such
a storage facility.
Hadoop9 is a distributed file system where files can
be saved with replication. It is an ideal candidate for
building a storage system. Hadoop features high fault
tolerance and great reliability. In addition, it also contains
an implementation of the MapReduce [13] programming
4. http://challenge.semanticweb.org
5. http://www.openrdf.org
6. http://www.w3TR/rdf-concepts/#dfn-rdf-triple
7. http://www.w3TR/rdf-sparql-query/#defn TriplePattern
8. http://www.w3TR/rdf-sparql-query/#defn QueryVariable
9. http://hadoop.apache.org
model, a functional programming model which is suitable
for the parallel processing of large amounts of data.
Through partitioning data into a number of independent
chunks, MapReduce processes run against these chunks,
making parallelization simpler. Moreover, the MapReduce
programming model facilitates and simplifies the
task of joining multiple triple patterns.
In this paper, we will describe a schema to store RDF
data in Hadoop, and we will detail a solution to process
queries against this data. In the preprocessing stage, we
process RDF data and populate files in the distributed
file system. This process includes partitioning and organizing
the data files and executing dictionary encoding.
We will then detail a query engine for information
retrieval. We will specify exactly how SPARQL queries
will be satisfied using MapReduce programming. Specifically,
we must determine the Hadoop ”jobs” that will
be executed to solve the query. We will present a greedy
algorithm that produces a query plan with the minimal
number of Hadoop jobs. This is an approximation algorithm
using heuristics, but we will prove that the worst
case has a reasonable upper bound.
Finally, we will utilize two standard benchmar
datasets to run experiments. We will present results for
dataset ranging from 0.1 to over 6.6 billion triples. We
will show that our solution is exceptionally scalable. We
will show that our solution outperforms leading state-ofthe-
art semantic web repositories, using standard benchmark
queries on very large datasets.
Our contributions are as follows:
1) we design a storage scheme to store RDF data in
Hadoop distributed file system (HDFS10).
2) we propose an algorithm that is guaranteed to
provide a query plan whose cost is bounded by
the log of the total number of variables in the
given SPARQL query. It uses summary statistics for
estimating join selectivity to break ties.
3) we build a framework which is highly scalable and
fault tolerant and supports data intensive query
processing.
4) we demonstrate that our approach performs better
than Jena for all queries and BigOWLIM and RDF-
3X for complex queries having large result sets.
The remainder of this paper is organized as follows:
in Section 2, we investigate related work. In Section 3,
we discuss our system architecture. In Section 4, we
discuss how we answer a SPARQL query. In Section
5, we present the results of our experiments. Finally, in
Section 6, we draw some conclusions and discuss areas
we have identified for improvement in the future.
2 RELATED WORK
MapReduce, though a programming paradigm, is
rapidly being adopted by researchers. This technology is
becoming increasingly popular in the community which
10. http://hadoop.apachecore/docs/r0.18.3/hdfs design.html
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3
handles large amounts of data. It is the most promising
technology to solve the performance issues researchers
are facing in Cloud Computing. In [1], Abadi discusses
how MapReduce can satisfy most of the requirements to
build an ideal Cloud DBMS. Researchers and enterprises
are using MapReduce technology for web indexing,
searches and data mining. In this section, we will first
investigate research related to MapReduce. Next, we will
discuss works related to the semantic web.
Google uses MapReduce for web indexing, data storage
and social networking [8]. Yahoo! uses MapReduce
extensively in their data analysis tasks [31]. IBM has successfully
experimented with a scale-up scale-out search
framework using MapReduce technology [27]. In a recent
work [35], they have reported how they integrated
Hadoop and System R. Teradata did a similar work by
integrating Hadoop with a parallel DBMS [42].
Researchers have used MapReduce to scale up classifiers
for mining petabytes of data [28]. They have
worked on data distribution and partitioning for data
mining, and have applied three data mining algorithms
to test the performance. Data mining algorithms are
being rewritten in different forms to take advantage of
MapReduce technology. In [11], researchers rewrite wellknown
machine learning algorithms to take advantage of
multicore machines by leveraging MapReduce programming
paradigm. Another area where this technology
is successfully being used is simulation [25]. In [4],
researchers reported an interesting idea of combining
MapReduce with existing relational database techniques.
These works differ from our research in that we use
MapReduce for semantic web technologies. Our focus is
on developing a scalable solution for storing RDF data
and retrieving them by SPARQL queries.
In the semantic web arena, there has not been much
work done with MapReduce technology. We have found
two related projects: BioMANTA11 project and SHARD
12. BioMANTA proposes extensions to RDF Molecules
[14] and implements a MapReduce based Molecule store
[30]. They use MapReduce to answer the queries. They
have queried a maximum of 4 million triples. Our work
differs in the following ways: first, we have queried
1 billion triples. Second, we have devised a storage
schema which is tailored to improve query execution
performance for RDF data. We store RDF triples in files
based on the predicate of the triple and the type of the
object. Finally, we also have an algorithm to determine a
query processing plan whose cost is bounded by the log
of the total number of variables in the given SPARQL
query. By using this, we can determine the input files of
a job and the order in which they should be run. To the
best of our knowledge, we are the first ones to come up
with a storage schema for RDF data using flat files in
HDFS, and a MapReduce job determination algorithm
11. http://www.itee.uq.edu.au/ eresearch/projects/biomanta
12. http://www.clouderablog/2010/03/how-raytheonresearchers-
are-using-hadoop-to-build-a-scalable-distributed-triplestore
to answer a SPARQL query.
SHARD (Scalable, High-Performance, Robust and Distributed)
is a RDF triple store using the Hadoop Cloudera
distribution. This project shows initial results demonstrating
Hadoop’s ability to improve scalability for RDF
datasets. However, SHARD stores its data only in a triple
store schema. It currently does no query planning or
reordering, and its query processor will not minimize
the number of Hadoop jobs.
There has been significant research into semantic web
repositories, with particular emphasis on query efficiency
and scalability. In fact, there are too many such
repositories to fairly evaluate and discuss each. Therefore,
we will pay attention to semantic web repositories
which are open source or available for download, and
which have receieved favorable recognition in the semantic
web and database communities.
In [2] and [3], researchers reported a vertically partitioned
DBMS for storage and retrieval of RDF data.
Their solution is a schema with a two column table for
each predicate. Their schema is then implemented on
top of a column-store relational database such as CStore
[37] or MonetDB [6]. They observed performance improvement
with their scheme over traditional relational
database schemes. We have leveraged this technology
in our predicate-based partitioning within the MapReduce
framework. However, in the vertical partititioning
research, only small databases (< 100 million) were used.
Several papers [16], [23], [41] have shown that vertical
partitioning’s performance is drastically reduced as the
dataset size is increased.
Jena [7] is a semantic web framework for Jena. True
to its framework design, it allows integration of multiple
solutions for persistence. It also supports inference
through the development of reasoners. However, Jena is
limited to a triple store schema. In other words, all data
is stored in a single three column table. Jena have very
poor query performance for large datasets. Furthermore,
any change to the dataset requires complete recalculation
of the inferred triples.
BigOWLIM [22] is among the fastest and most scalable
semantic web frameworks available. However, it is not
as scalable as our framework and requires very high end
and costly machines. It requires expensive hardware (a
lot of main memory) to load large datasets and it has
a long loading time. As our experiments show (Section
5.4), it does not perform well when there is no bound
object in a query. However, the performance of our
framework is not affected in such a case.
RDF-3X [29] is considered the fastest existing semantic
web repository. In other words, it has the fastest query
times. RDF-3X uses histograms, summary statistics, and
query optimization to enable high performance semantic
web queries. As a result, RDF-3X is generally able to outperform
any other solution for queries with bound objects
and aggregate queries. However, RDF-3X’s performance
degrades exponentially for unbound queries, and
queries with even simple joins if the selectivity factor
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4
is low. This becomes increasingly relevant for inference
queries, which generally require unions of subqueries
with unbound objects. Our experiments show that RDF-
3X is not only slower for such queries, it often aborts and
cannot complete the query. For example, consider the
simple query ”Select all students.” This query in LUBM
requires us to select all graduate students, select all
undergraduate students, and union the results together.
However, there are a very large number of results in this
union. While both subqueries complete easily, the union
will abort in RDF-3X for LUBM (30,000) with 3.3 billion
triples.
RDFKB [24](RDF Knowledge Base) is a semantic web
repository using a relational database schema built upon
bit vectors. RDFKB achieves better query performance
than RDF-3X or vertical partitioning. However, RDFKB
aims to provide knowledge base functions such as inference
forward chaining, uncertainty reasoning and ontology
alignment. RDFKB prioritizes these goals ahead of
scalability. RDFKB is not able to load LUBM(30,000) with
3 billion triples, so it cannot compete with our solution
for scalability.
Hexastore [41] and BitMat [5] are main memory data
structures optimized for RDF indexing. These solutions
may achieve exceptional performance on hot runs, but
they are not optimized for cold runs from persistent storage.
Furthermore, their scalability is directly associated
to the quantity of main memory RAM available. These
products are not available for testing and evaluation.
In our previous works [20], [21], we proposed a
greedy and an exhaustive search algorithm to generate a
query processing plan. However, the exhaustive search
algorithm was expensive and the greedy one was not
bounded and its theoretical complexity was not defined.
In this paper, we present a new greedy algorithm with
an upper bound. Also, we did observe scenarios in
which our old greedy algorithm failed to generate the
optimal plan. The new algorithm is able to obtain the
optimal plan in each of these cases. Furthermore, in
our prior research we were limited to text files with
minimal partitioning and indexing. We now utilize dictionary
encoding to increase performance. We have also
now done comparison evaluation with more alternative
repositories.
3 PROPOSED ARCHITECTURE
Our architecture consists of two components. The upper
part of Figure 1 depicts the data preprocessing component
and the lower part shows the query answering one.
We have three subcomponents for data generation and
preprocessing. We convert RDF/XML13 to N-Triples14
serialization format using our N-Triples Converter component.
The PS component takes the N-Triples data and
splits it into predicate files. The predicate files are then
fed into the POS component which splits the predicate
13. http://www.w3TR/rdf-syntax-grammar
14. http://www.w32001/sw/RDFCore/ntriples
Fig. 1. The System Architecture
files into smaller files based on the type of objects. These
steps are described in Section 3.2, 3.3 and 3.4.
Our MapReduce framework has three subcomponents
in it. It takes the SPARQL query from
the user and passes it to the Input Selector (see Section
4.1) and Plan Generator. This component selects the
input files, by using our algorithm described in Section
4.3, decides how many MapReduce jobs are needed and
passes the information to the Join Executer component
which runs the jobs using MapReduce framework. It
then relays the query answer from Hadoop to the user.
3.1 Data Generation and Storage
For our experiments, we use the LUBM [18] dataset. It is
a benchmark datasets designed to enable researchers to
evaluate a semantic web repository’s performance [19].
The LUBM data generator generates data in RDF/XML
serialization format. This format is not suitable for our
purpose because we store data in HDFS as flat files and
so to retrieve even a single triple we would need to parse
the entire file. Therefore we convert the data to N-Triples
to store the data, because with that format we have a
complete RDF triple (Subject, Predicate and Object) in
one line of a file, which is very convenient to use with
MapReduce jobs. The processing steps to go through to
get the data into our intended format are described in
following sections.
3.2 File Organization
We do not store the data in a single file because, in
Hadoop and MapReduce Framework, a file is the smallest
unit of input to a MapReduce job and, in the absence
of caching, a file is always read from the disk. If we have
all the data in one file, the whole file will be input to jobs
for each query. Instead, we divide the data into multiple
smaller files. The splitting is done in two steps which
we discuss in the following sections.
3.3 Predicate Split (PS)
In the first step, we divide the data according to the
predicates.This division immediately enables us to cut
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5
down the search space for any SPARQL query which
does not have a variable15 predicate. For such a query,
we can just pick a file for each predicate and run the
query on those files only. For simplicity, we name the
files with predicates, e.g. all the triples containing a
predicate p1:pred go into a file named p1-pred. However,
in case we have a variable predicate in a triple pattern16
and if we cannot determine the type of the object, we
have to consider all files. If we can determine the type
of the object then we consider all files having that type
of object. We discuss more on this in Section 4.1. In
real world RDF datasets, the number of distinct predicates
is in general not a large number [36]. However,
there are datasets having many predicates. Our system
performance does not vary in such a case because we
just select files related to the predicates specified in a
SPARQL query.
3.4 Predicate Object Split (POS)
3.4.1 Split Using Explicit Type Information of Object
In the next step, we work with the explicit type information
in the rdf type file. The predicate rdf:type is used
in RDF to denote that a resource is an instance of a
class. The rdf type file is first divided into as many files
as the number of distinct objects the rdf:type predicate
has. For example, if in the ontology the leaves of the
class hierarchy are c1, c2, ..., cn then we will create files
for each of these leaves and the file names will be like
type c1, type c2, ... , type cn. Please note that the object
values c1, c2, ..., cn are no longer needed to be stored
within the file as they can be easily retrieved from the file
name. This further reduces the amount of space needed
to store the data. We generate such a file for each distinct
object value of the predicate rdf:type.
3.4.2 Split Using Implicit Type Information of Object
We divide the remaining predicate files according to the
type of the objects. Not all the objects are URIs, some
are literals. The literals remain in the file named by the
predicate: no further processing is required for them. The
type information of a URI object is not mentioned in
these files but they can be retrieved from the type * files.
The URI objects move into their respective file named as
predicate type. For example, if a triple has the predicate
p and the type of the URI object is ci, then the subject
and object appears in one line in the file p ci. To do this
split we need to join a predicate file with the type * files
to retrieve the type information.
In Table 1, we show the number of files we get after PS
and POS steps. We can see that eventually we organize
the data into 41 files.
Table 1 shows the number of files and size gain we
get at each step for data from 1000 universities. LUBM
generator generates 20020 small files, with a total size of
24 GB. After splitting the data according to predicates
15. http://www.w3TR/rdf-sparqlquery/#
sparqlQueryVariables
16. http://www.w3TR/rdf-sparqlquery/#
sparqlTriplePatterns
Step Files Size (GB) Space Gain
N-Triples 20020 24 -
PS 17 7.1 70.42%
POS 41 6.6 7.04%
TABLE 1
Data size at various steps for 1000 universities
Fig. 2. Partial LUBM Ontology (are denotes subClassOf
relationship)
the size drastically reduces to only 7.1 GB (a 70.42%
gain). This happens because of the absence of predicate
columns and also the prefix substitution. At this step,
we have only 17 files as there are 17 unique predicates
in the LUBM data set. In the final step, space is reduced
another 7.04%, as the split rdf-type files no longer has
the object column. The number of files increases to 41 as
predicate files are split using the type of the objects.
3.5 Example Data
In Table 2, we have shown sample data for three predicates.
The left most column shows the type file for
student objects after the splitting by using explicit type
information in POS step. It lists only the subjects of the
triples having rdf:type predicate and student object. The
rest of the columns show the the advisor, takesCourse
and teacherOf predicate files after the splitting by using
implicit type information in POS step. The prefix ub:
stands for http://www.lehigh.edu/∼zhp2/2004/0401/
univ-bench.owl#. Each row has a pair of subject and
TABLE 2
Sample data for LUBM Query 9
type Student ub:advisor FullProfessor
GS1 GS2 A2
GS2 GS1 A1
GS3 GS3 A3
ub:takesCourse Course ub:teacherOf Course
GS1 C2 A1 C1
GS3 C1 A2 C2
GS2 C3 A3 C3
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 6
object. In all cases, the predicate can be retrieved from
the filename.
3.6 Binary Format
Up to this point, we have shown our files in text format.
Text format is the natively supported format by Hadoop.
However, for increased efficiency, storing data in binary
format is an option.We do dictionary encoding to encode
the strings with a long value (64-bit). In this way, we
are able to store up to 264 unique strings. We dictionary
encode the data using Hadoop jobs. We build a prefix
tree in each reducer and generate a unique id for a string
by using the reducer id, which is unique across the job.
We generate the dictionary in one job and then run three
jobs to replace the subject, predicate and object of a triple
with their corresponding id as text. In the final job, we
convert the triples consisting of ids in text to binary data.