27-08-2014, 12:50 PM
• PRELIMINARY
PRELIMINARY.docx (Size: 696.58 KB / Downloads: 9)
ABSTRACT
As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core operation in XML query processing. Prior works demonstrate that holistic twig pattern matching algorithm is an efficient technique to answer an XML tree pattern with parent-child (P-C) and ancestor-descendant (A-D) relationships, as it can effectively control the size of intermediate results during query processing. However, XML query languages (e.g., XPath and XQuery) define more axes and functions such as negation function, order-based axis, and wildcards. In this paper, we research a large set of XML tree pattern, called extended XML tree pattern, which may include P-C, A-D relationships, negation functions, wildcards, and order restriction. We establish a theoretical framework about “matching cross” which demonstrates the intrinsic reason in the proof of optimality on holistic algorithms. Based on our theorems, we propose a set of novel algorithms to efficiently process three categories of extended XML tree patterns. A set of experimental results on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our proposed theories and algorithms.
INTRODUCTION
Business and enterprises generate and exchange XML data more often, there is an increasing need for efficient processing of queries on XML data. An XML query pattern commonly can be represented as a rooted, labeled tree (or called twig). For example, Fig. 1a shows an example XPath query: A=C and the corresponding XML tree pattern. This query finds all node C that has the parent A which has another child B. In Fig. 1b, the query answers are nodes “C1 ” and “C2 .”
Efficient matching of XML tree patterns has been widely considered as a core operation in XML query processing. In recent years, many methods [9], [13], [3], [11], [4], [25] have been proposed to match XML tree queries efficiently. In particular, Khalifa et al. [1] proposed a stack-based algorithm to match binary structural relationship including parent- child (P-C) and ancestor-descendant (A-D) relationships. The limitation of their method is that the size of useless intermediate results may become very large, even if the final results are small. Bruno et al. [3] proposed a novel holistic twig join algorithm named TwigStack, which processes the tree pattern holistically without decomposing it into several small binary relationships. TwigStack guarantees that there are no “useless” intermediate results for queries with only ancestor-descendant (A-D) relationships. In other words, TwigStack is optimal for tree pattern queries with only A-D edges [8]. Many other recent works then examine how to enlarge the optimal query class of holistic algorithms [14], to speed up performance using indexes [5], [11], to devise new data streaming strategies [6], and to propose efficient and dynamic labeling schemes [16]. These algorithms have proven highly promising and make their way into XML query processing applications, in both academic and industrial settings [19]. But we still have the following observations upon the above existing works.
Extended XML tree pattern. Previous algorithms focus on XML tree pattern queries with only P-C and A-D relation- ships. Little work has been done on XML tree queries which may contain wildcards, negation function, and order restric- tion, all of which are frequently used in XML query languages such as XPath and XQuery. In this paper, we call an XML tree pattern with negation function, wildcards, and/ or order restriction as extended XML tree pattern. Fig. 2, for example, shows four extended XML tree patterns. Query (a) includes a wildcard node “*”, which can match any single node in an XML database. Query (b) includes a negative edge, denoted by “:”. This query finds A that has a child B, but has no child C. In XPath language [2], the semantic of negative edge can be presented with “not” boolean function. Query © has the order restriction, which is equivalent to an XPath “//A/B[following-sibling::C].” The “< ” in a box shows that all children under A are ordered. The semantics of order-base tree pattern is captured by a mapping from the pattern nodes to nodes in an XML database such that the structural and ordered relationships are satisfied. Finally, Query (d) is more complicated, which contains wildcards, negation function, and order restriction.
[b]• Outline
The extended XML tree pattern matching algorithm called TreeMatch. Section 5 presents thorough experimental studies between the novel algorithms and the prior methods. Finally, Section 6 presents previous work related to the XML tree pattern The rest of the paper is organized as follows: Section 2 gives the preliminaries about research problem and the proces- sing model. Section 3 shows a set of theories about matching cross and Section 4 presents anmatching and Section 7 concludes this paper.
• Labeling Schemes
Most XML query processing algorithms on XML documents rely on certain labeling schemes, such as region encoding scheme [27], prefix scheme [13], ORDPATH [19], and extended Dewey scheme [16]. In this paper, we use the extended Dewey labeling scheme, proposed in paper [16], to assign each node in XML documents a sequence of integers to capture the structure information of documents.
• EXPERIMENTS:
In this section, we present an extensive experimental study of TreeMatch on real-life and synthetic data sets. Our results verify the effectiveness, in terms of accuracy and optimality, of the TreeMatch as holistic twig join algorithms for large XML data sets. These benefits become apparent in a comparison to previously four proposed algorithms TwigStack [3], TJFast [16], OrderedTJ [17], and TwigStack- ListNot [26]. The reason that we choose these algorithms for comparison is that 1) similar to TreeMatch, both TJFast and TwigStack are two holistic twig pattern matching algo- rithms. But they cannot process queries with order restriction or negative edges; and 2) OrderedTJ is a holistic twig algorithm which can handle order-based XML tree pattern, but is not appropriate for queries with negative edges; and finally 3) TwigStacklistNot is proposed for queries with negative edges, but it cannot work for ordered queries. Only TreeMatch algorithm can process queries with order restriction, negative edge, and wildcards
• CONCLUSIONS
We have introduced a notion of matching cross to address the problem of the suboptimality in holistic XML tree patten matching algorithms. We have identified a large optimal query classes for three kinds of queries, that is Q=;==; , Q=;==; ;< , and Q=;==; ;<;: , respectively. Based on these results, we have proposed a new holistic algorithm called Tree- Match to achieve such theoretical optimal query classes. Finally, extensive experiments demonstrate the advantage of our algorithms and verify the correctness of theoretical results