29-06-2013, 04:52 PM
Extended XML Tree Pattern Matching: Theories and Algorithms
Extended XML.pdf (Size: 1.16 MB / Downloads: 17)
Abstract
As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient
processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core operation in
XML query processing. Prior works demonstrate that holistic twig pattern matching algorithm is an efficient technique to answer an XML
tree pattern with parent-child (P-C) and ancestor-descendant (A-D) relationships, as it can effectively control the size of intermediate
results during query processing. However, XML query languages (e.g. XPath, XQuery) define more axes and functions such as negation
function, order-based axis and wildcards. In this article, we research a large set of XML tree pattern, called extended XML tree pattern,
which may include P-C, A-D relationships, negation functions, wildcards and order restriction. We establish a theoretical framework
about “matching cross” which demonstrates the intrinsic reason in the proof of optimality on holistic algorithms. Based on our theorems,
we propose a set of novel algorithms to efficiently process three categories of extended XML tree patterns. A set of experimental results
on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our proposed theories and algorithms.
INTRODUCTION
As business and enterprisers generate and exchange XML
data more often, there is an increasing need for efficient
processing of queries on XML data. An XML query pattern
commonly can be represented as a rooted, labeled tree (or
called twig). For example, Figure 1(a) shows an example
XPath query: A=C and the corresponding XML tree pattern.
This query finds all node C that has the parent A which
has another child B. In Figure 1(b)
[b]Outline
The rest of the article is organized as follows. Section
2 gives the preliminaries about research problem and the
processing model. Section 3 shows a set of theories about
matching cross and Section 4 presents an extended XML
tree pattern matching algorithm called TreeMatch. Section 5
presents thorough experimental studies between the novel
algorithms and the prior methods. Finally, Section 6 presents
previous work related to the XML tree pattern matching and
Section 7 concludes this article.
Labeling Schemes
Most XML query processing algorithms on XML documents
rely on certain labeling schemes, such as region
encoding scheme [27], prefix scheme [13], ORDPATH [19],
extended Dewey scheme [16]. In this article, we use the
extended Dewey labeling scheme, proposed in paper [16], to
assign each node in XML documents a sequence of integers
to capture the structure information of documents.
Basic properties of algorithms
The algorithms for XML tree pattern matching proposed in
this article have two basic yet important properties, as follows.
Single-direction scan We adopt a structure, named label
list, associated with each query node. The label list is a posting
list (or inverted list) containing the extended Dewey labels of
XML elements which have the same name, and all elements
are ordered according to the document order. We use TA to
denote the label list for query node A. There is a cursor for
each list. It moves in the single direction to scan all elements
once in increasing order. Each label in a list can be read only
once.
THEORETICAL ANALYSIS
In this section, we establish a theoretical framework about
“matching cross” which demonstrates the intrinsic reason
for the sub-optimality of existing holistic algorithms. The
purposes of our study are (i) to provide insight into the
characteristics of the holistic algorithms, and thus promotes
our understanding about their behaviors; and (ii) to lead to
novel algorithms that can guarantee a larger optimal query
class and realize better query performance.
Matching Cross
The existing holistic algorithms ([11], [16]) consist of two
phases: (i) in the first phase, a list of path solutions is output
as intermediate path solutions and each solution matches the
individual root-to-leaf path expression; and (ii) in the second
phase, the path solutions are merged to produce the final
answers for the whole twig query. However, for queries with
parent-child (P-C) relationships, the state-of-the-art algorithms
cannot guarantee that each intermediate solution output in the
first phase is useful to merge in the second phase. In other
words, many useless intermediate results (i.e. path solutions)
may be produced in the first phase, which is called the suboptimality
of algorithms, as further illustrated in the following
example.
RELATED WORK
In the context of semi-structured and XML databases, treebased
query pattern is a very practical and important class
of queries. Lore DBMS [9] and Timber [10] systems have
considered various aspects of query processing on such data
and queries. XML data and various issues in their storage
as well as query processing using relational database systems
have recently been considered in [18], [27], [22], [19]. Our
holistic algorithm TreeMatch for extended tree patterns can
leverage these previous techniques.
From the aspect of theoretical research about the optimality
of XML tree pattern matching, Choi et al. [8] developed
theorems to prove that it is impossible to devise a holistic
algorithm to guarantee the optimality for queries with any
combination of P-C and A-D relationships. Shalem et al. [21]
researched the space complexity of processing XML twig
queries. Their paper showed that the upper bound of full-fledge
queries with parent-child and ancestor-descendant edges are
O(D), where D is the document size. In other words, their
results also theoretically prove that there exists no algorithm
to optimally process an arbitrary query Q=;==;¤. Our research
in this article moves the frontier forward by identifying a
large subclass of Q=;==;¤, which can be guaranteed to process
optimally.
CONCLUSIONS
We have introduced a notion of matching cross to address
the problem of the sub-optimality in holistic XML tree patten
matching algorithms. We have identified a large optimal query
classes for three kinds of queries, that is Q=;==;¤, Q=;==;¤;<
and Q=;==;¤;<;:, respectively. Based on these results, we
have proposed a new holistic algorithm called TreeMatch to
achieve such theoretical optimal query classes. Finally, extensive
experiments demonstrate the advantage of our algorithms
and verify the correctness of theoretical results.
Extended XML.pdf (Size: 1.16 MB / Downloads: 17)
Abstract
As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient
processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core operation in
XML query processing. Prior works demonstrate that holistic twig pattern matching algorithm is an efficient technique to answer an XML
tree pattern with parent-child (P-C) and ancestor-descendant (A-D) relationships, as it can effectively control the size of intermediate
results during query processing. However, XML query languages (e.g. XPath, XQuery) define more axes and functions such as negation
function, order-based axis and wildcards. In this article, we research a large set of XML tree pattern, called extended XML tree pattern,
which may include P-C, A-D relationships, negation functions, wildcards and order restriction. We establish a theoretical framework
about “matching cross” which demonstrates the intrinsic reason in the proof of optimality on holistic algorithms. Based on our theorems,
we propose a set of novel algorithms to efficiently process three categories of extended XML tree patterns. A set of experimental results
on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our proposed theories and algorithms.
INTRODUCTION
As business and enterprisers generate and exchange XML
data more often, there is an increasing need for efficient
processing of queries on XML data. An XML query pattern
commonly can be represented as a rooted, labeled tree (or
called twig). For example, Figure 1(a) shows an example
XPath query: A=C and the corresponding XML tree pattern.
This query finds all node C that has the parent A which
has another child B. In Figure 1(b)
[b]Outline
The rest of the article is organized as follows. Section
2 gives the preliminaries about research problem and the
processing model. Section 3 shows a set of theories about
matching cross and Section 4 presents an extended XML
tree pattern matching algorithm called TreeMatch. Section 5
presents thorough experimental studies between the novel
algorithms and the prior methods. Finally, Section 6 presents
previous work related to the XML tree pattern matching and
Section 7 concludes this article.
Labeling Schemes
Most XML query processing algorithms on XML documents
rely on certain labeling schemes, such as region
encoding scheme [27], prefix scheme [13], ORDPATH [19],
extended Dewey scheme [16]. In this article, we use the
extended Dewey labeling scheme, proposed in paper [16], to
assign each node in XML documents a sequence of integers
to capture the structure information of documents.
Basic properties of algorithms
The algorithms for XML tree pattern matching proposed in
this article have two basic yet important properties, as follows.
Single-direction scan We adopt a structure, named label
list, associated with each query node. The label list is a posting
list (or inverted list) containing the extended Dewey labels of
XML elements which have the same name, and all elements
are ordered according to the document order. We use TA to
denote the label list for query node A. There is a cursor for
each list. It moves in the single direction to scan all elements
once in increasing order. Each label in a list can be read only
once.
THEORETICAL ANALYSIS
In this section, we establish a theoretical framework about
“matching cross” which demonstrates the intrinsic reason
for the sub-optimality of existing holistic algorithms. The
purposes of our study are (i) to provide insight into the
characteristics of the holistic algorithms, and thus promotes
our understanding about their behaviors; and (ii) to lead to
novel algorithms that can guarantee a larger optimal query
class and realize better query performance.
Matching Cross
The existing holistic algorithms ([11], [16]) consist of two
phases: (i) in the first phase, a list of path solutions is output
as intermediate path solutions and each solution matches the
individual root-to-leaf path expression; and (ii) in the second
phase, the path solutions are merged to produce the final
answers for the whole twig query. However, for queries with
parent-child (P-C) relationships, the state-of-the-art algorithms
cannot guarantee that each intermediate solution output in the
first phase is useful to merge in the second phase. In other
words, many useless intermediate results (i.e. path solutions)
may be produced in the first phase, which is called the suboptimality
of algorithms, as further illustrated in the following
example.
RELATED WORK
In the context of semi-structured and XML databases, treebased
query pattern is a very practical and important class
of queries. Lore DBMS [9] and Timber [10] systems have
considered various aspects of query processing on such data
and queries. XML data and various issues in their storage
as well as query processing using relational database systems
have recently been considered in [18], [27], [22], [19]. Our
holistic algorithm TreeMatch for extended tree patterns can
leverage these previous techniques.
From the aspect of theoretical research about the optimality
of XML tree pattern matching, Choi et al. [8] developed
theorems to prove that it is impossible to devise a holistic
algorithm to guarantee the optimality for queries with any
combination of P-C and A-D relationships. Shalem et al. [21]
researched the space complexity of processing XML twig
queries. Their paper showed that the upper bound of full-fledge
queries with parent-child and ancestor-descendant edges are
O(D), where D is the document size. In other words, their
results also theoretically prove that there exists no algorithm
to optimally process an arbitrary query Q=;==;¤. Our research
in this article moves the frontier forward by identifying a
large subclass of Q=;==;¤, which can be guaranteed to process
optimally.
CONCLUSIONS
We have introduced a notion of matching cross to address
the problem of the sub-optimality in holistic XML tree patten
matching algorithms. We have identified a large optimal query
classes for three kinds of queries, that is Q=;==;¤, Q=;==;¤;<
and Q=;==;¤;<;:, respectively. Based on these results, we
have proposed a new holistic algorithm called TreeMatch to
achieve such theoretical optimal query classes. Finally, extensive
experiments demonstrate the advantage of our algorithms
and verify the correctness of theoretical results.