17-11-2012, 02:20 PM
Extended XML Tree Pattern Matching: Theories and Algorithms
1Extended XML Tree Pattern.pdf (Size: 1.27 MB / Downloads: 29)
Abstract
As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient
processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core operation in
XML query processing. Prior works demonstrate that holistic twig pattern matching algorithm is an efficient technique to answer an
XML tree pattern with parent-child (P-C) and ancestor-descendant (A-D) relationships, as it can effectively control the size of
intermediate results during query processing. However, XML query languages (e.g., XPath and XQuery) define more axes and
functions such as negation function, order-based axis, and wildcards. In this paper, we research a large set of XML tree pattern, called
extended XML tree pattern, which may include P-C, A-D relationships, negation functions, wildcards, and order restriction. We
establish a theoretical framework about “matching cross” which demonstrates the intrinsic reason in the proof of optimality on holistic
algorithms. Based on our theorems, we propose a set of novel algorithms to efficiently process three categories of extended XML tree
patterns. A set of experimental results on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our
proposed theories and algorithms.
INTRODUCTION
AS business and enterprises generate and exchange XML
data more often, there is an increasing need for
efficient processing of queries on XML data. An XML query
pattern commonly can be represented as a rooted, labeled
tree (or called twig). For example, Fig. 1a shows an example
XPath query: A=C and the corresponding XML tree
pattern. This query finds all node C that has the parent A
which has another child B. In Fig. 1b, the query answers are
nodes “C1” and “C2.”
Efficient matching of XML tree patterns has been widely
considered as a core operation in XML query processing. In
recent years, many methods [9], [13], [3], [11], [4], [25] have
been proposed to match XML tree queries efficiently. In
particular, Khalifa et al. [1] proposed a stack-based algorithm
to match binary structural relationship including parentchild
(P-C) and ancestor-descendant (A-D) relationships.
The limitation of their method is that the size of useless
intermediate results may become very large, even if the final
results are small. Bruno et al. [3] proposed a novel holistic
twig join algorithm named TwigStack, which processes the
tree pattern holistically without decomposing it into several
small binary relationships.
[b]Outline
The rest of the paper is organized as follows: Section 2 gives
the preliminaries about research problem and the processing
model. Section 3 shows a set of theories about
matching cross and Section 4 presents an extended XML
tree pattern matching algorithm called TreeMatch. Section 5
presents thorough experimental studies between the novel
algorithms and the prior methods. Finally, Section 6
presents previous work related to the XML tree pattern
matching and Section 7 concludes this paper.
Basic Properties of Algorithms
The algorithms for XML tree pattern matching proposed in
this paper have two basic yet important properties, as
follows:
Single-Direction Scan
We adopt a structure, named label list, associated with each
query node. The label list is a posting list (or inverted list)
containing the extended Dewey labels of XML elements
which have the same name, and all elements are ordered
according to the document order. We use TA to denote the
label list for query node A. There is a cursor for each list. It
moves in the single direction to scan all elements once in
increasing order. Each label in a list can be read only once.
Bounded Main Memory
For a large class of queries, the main memory requirement
of our algorithm is linear to the number of nodes in the
longest path of XML database, which is usually small.
Therefore, our solution would be scalable to a very large
document with a small main memory requirement.
Recall that the existing algorithms, such as TwigStack
[3], TwigStackList [14], TJFast [16], also have the first
property. That is, they keep the single-direction scan of the
document. But for the second property, those algorithms
guarantee the bounded main memory for a small class of
queries. This paper makes the contribution to propose
algorithms to achieve this property for a much larger class
of queries with negation predicates, wildcards, and order
restriction.
THEORETICAL ANALYSIS
In this section, we establish a theoretical framework about
“matching cross” which demonstrates the intrinsic reason for
the suboptimality of existing holistic algorithms. The
purposes of our study are 1) to provide insight into the
characteristics of the holistic algorithms, and thus promotes
our understanding about their behaviors; and 2) to lead to
novel algorithms that can guarantee a larger optimal query
class and realize better query performance.
TreeMatch
Now we go through Algorithm 1. Line 1 locates the first
elements whose pathes match the individual root-leaf path
pattern. In each iteration, a leaf node fact is selected by
getNext function (line 3). The purpose of lines 4 and 5 is to
insert the potential matching elements to outputlist. Line 6
advances the list Tfact and line 7 updates the set encoding.
Line 8 locates the next matching element to the individual
path. Finally, when all data have been processed, we need
to empty all sets in Procedure EmptyAllSets (line 9) to
guarantee the completeness of output solutions.