28-11-2012, 01:30 PM
Data Mining for XML Query-Answering Support
Data Mining for XML Query.pdf (Size: 1.55 MB / Downloads: 61)
Abstract
Extracting information from semistructured documents is a very hard task, and is going to become more and more critical
as the amount of digital information available on the Internet grows. Indeed, documents are often so large that the data set returned as
answer to a query may be too big to convey interpretable knowledge. In this paper, we describe an approach based on Tree-Based
Association Rules (TARs): mined rules, which provide approximate, intensional information on both the structure and the contents of
Extensible Markup Language (XML) documents, and can be stored in XML format as well. This mined knowledge is later used to
provide: 1) a concise idea—the gist—of both the structure and the content of the XML document and 2) quick, approximate answers to
queries. In this paper, we focus on the second feature. A prototype system and experimental results demonstrate the effectiveness of
the approach.
INTRODUCTION
IN recent years, the database research field has concentrated
on the Extensible Markup Language (XML) [30] as
a flexible hierarchical model suitable to represent huge
amounts of data with no absolute and fixed schema, and a
possibly irregular and incomplete structure. There are two
main approaches to XML document access: keyword-based
search and query-answering. The first one comes from the
tradition of information retrieval [20], where most searches
are performed on the textual content of the document; this
means that no advantage is derived from the semantics
conveyed by the document structure.
As for query-answering, since query languages for semistructured
data rely on the document structure to convey its
semantics, in order for query formulation to be effective
users need to know this structure in advance, which is often
not the case. In fact, it is not mandatory for an XML
document to have a defined schema: 50 percent of the
documents on the web do not possess one [5]. When users
specify queries without knowing the document structure,
they may fail to retrieve information which was there, but
under a different structure. This limitation is a crucial
problem which did not emerge in the context of relational
database management systems.
Goal and Contributions
This paper provides a method for deriving intensional
knowledge from XML documents in the form of TARs, and
then storing these TARs as an alternative, synthetic data set
to be queried for providing quick and summarized answers.
Our procedure is characterized by the following key aspects:
1. It works directly on the XML documents, without
transforming the data into any intermediate format.
2. It looks for general association rules, without the
need to impose what should be contained in the
antecedent and consequent of the rule.
3. It stores association rules in XML format.
4. It translates the queries on the original data set into
queries on the TARs set.
The aim of our proposal is to provide a way to use
intensional knowledge as a substitute of the original
document during querying and not to improve the
execution time of the queries over the original XML data
set, like in [34].
Structure of the Paper
The paper is organized as follows. Section 2 defines treebased
association rules and introduces their usage, while
Section 3 presents how these rules are extracted from XML
documents. Section 4 presents the main interesting application
of TARs, that is, their use to provide intensional
answers to queries. Section 5 describes a prototype that
implements our proposal and the experimental results
obtained on real XML data sets. Section 6 discusses related
work and Section 7 states the possible follow-ups of this
work and draws the conclusions.
TREE-BASED ASSOCIATION RULES
Association rules [1] describe the co-occurrence of data
items in a large amount of collected data and are represented
as implications of the form X ) Y , where X and Y are two
arbitrary sets of data items, such that X \ Y ¼ ;. The quality
of an association rule is measured by means of support and
confidence. Support corresponds to the frequency of the set
X [ Y in the data set, while confidence corresponds to the
conditional probability of finding Y , having found X and is
given by suppðX [ Y Þ=suppðXÞ.
In this paper, we extend the notion of association rule
introduced in the context of relational databases to adapt it
to the hierarchical nature of XML documents. Following the
Infoset conventions, we represent an XML document as a
tree2 hN;E; r; ‘; ci, where N is the set of nodes, r 2 N is the
root of the tree, E is the set of edges, ‘ : N ! L is the label
function which returns the tag of nodes (with L the domain
of all tags) and c : N ! C [ f?g is the content function
which returns the content of nodes (with C the domain of all
contents). We consider the element-only Infoset content
model [28], where XML nonterminal tags include only
other elements and/or attributes, while the text is confined
to terminal elements.
TAR EXTRACTION
TAR mining is a process composed of two steps: 1) mining
frequent subtrees, that is, subtrees with a support above a userdefined
threshold, from the XML document; 2) computing
interesting rules, that is, rules with a confidence above a userdefined
threshold, from the frequent subtrees.
As will be discussed in more detail in Section 6, the
problem of finding frequent subtrees has been widely
treated in the literature [36], [3], [32], [35], [37], [1].
Algorithm 1 presents our extension to a generic frequentsubtree
mining algorithm in order to compute interesting
TARs. The inputs of Algorithm 1 are the XML document D,
the threshold for the support of the frequent subtrees
minsupp, and the threshold for the confidence of the rules,
minconf. Algorithm 1 finds frequent subtrees and then
hands each of them over to a function that computes all the
possible rules. Depending on the number of frequent
subtrees and their cardinality, the amount of rules
generated by a naive Compute-Rules function may be very
high. Given a subtree with n nodes, we could generate 2n
2 rules, making the algorithm exponential. This explosion
occurs in the relational context too, thus, based on similar
considerations [1], it is possible to state the following
property, that allows us to propose the optimized version of
Compute-Rules shown in Function 2.
COMPARISON WITH OTHER WORK
The problem of association rule mining was initially
proposed in [1] and many implementations of the algorithms,
downloadable from [12], were developed in the
database literature. More recently the problem has been
investigated in the XML context [6], [31], [24], [8], [10], [19],
[33]. In [31], Wan and Dobbie use XQuery [29] to extract
association rules from simple XML documents. They
propose a set of functions, written in XQuery, which
implement the Apriori algorithm [1]. In [31], Wan and
Dobbie show that their approach performs well on simple
XML documents but it is very difficult to apply to complex
XML documents with an irregular structure. This limitation
is overcome in [6], where Braga et al. introduce a proposal to
enrich XQuery with data mining and knowledge discovery
capabilities, by introducing XMINE RULE, an operator for
mining association rules for native XML documents. They
formalize the syntax and semantics for the operator and
propose some examples of complex association rules.
CONCLUSIONS AND FUTURE WORK
The main goals we have achieved in this paper are: 1) mine
all frequent association rules without imposing any a-priori
restriction on the structure and the content of the rules;
2) store mined information in XML format; 3) use the
extracted knowledge to gain information about the original
data sets. We have developed a C++ prototype that has
been used to test the effectiveness of our proposal. We have
not discussed the updatability of both the document storing
TARs and their index. As an ongoing work, we are studying
how to incrementally update mined TARs when the
original XML data sets change and how to further optimize
our mining algorithm; moreover, for the moment we deal
with a (substantial) fragment of XQuery; we would like to
find the exact fragment of XQuery which lends itself to
translation into intensional queries.