26-07-2012, 11:06 AM
Towards an Effective XML Keyword Search
Towards an Effective XML Keyword Search.pdf (Size: 459.52 KB / Downloads: 35)
Abstract
Inspired by the great success of information retrieval (IR) style keyword search on the web, keyword search on XML has
emerged recently. The difference between text database and XML database results in three new challenges: (1) Identify the user search
intention, i.e. identify the XML node types that user wants to search for and search via. (2) Resolve keyword ambiguity problems: a
keyword can appear as both a tag name and a text value of some node; a keyword can appear as the text values of different XML node
types and carry different meanings; a keyword can appear as the tag name of different XML node types with different meanings. (3)
As the search results are sub-trees of the XML document, new scoring function is needed to estimate its relevance to a given query.
However, existing methods cannot resolve these challenges, thus return low result quality in term of query relevance.
In this paper, we propose an IR-style approach which basically utilizes the statistics of underlying XML data to address these challenges.
We first propose specific guidelines that a search engine should meet in both search intention identification and relevance oriented
ranking for search results. Then based on these guidelines, we design novel formulae to identify the search for nodes and search via
nodes of a query, and present a novel XML TF*IDF ranking strategy to rank the individual matches of all possible search intentions. To
complement our result ranking framework, we also take the popularity into consideration for the results that have comparable relevance
scores. Lastly, extensive experiments have been conducted to show the effectiveness of our approach.
Index Terms—XML, Keyword Search, Ranking
1 INTRODUCTION
The extreme success of web search engines makes keyword
search the most popular search model for ordinary users. As
XML is becoming a standard in data representation, it is
desirable to support keyword search in XML database. It is
a user friendly way to query XML databases since it allows
users to pose queries without the knowledge of complex query
languages and the database schema.
Effectiveness in term of result relevance is the most crucial
part in keyword search, which can be summarized as the
following three issues in XML field.
Issue 1: It should be able to effectively identify the type of
target node(s) that a keyword query intends to search for. We
call such target node as a search for node.
Issue 2: It should be able to effectively infer the types of
condition nodes that a keyword query intends to search via.
We call such condition nodes as search via nodes.
Issue 3: It should be able to rank each query result in
consideration of the above two issues.
The first two issues address the search intention problem,
while the third one addresses the relevance based ranking
problem w.r.t. the search intention. Regarding to Issue 1 and
Issue 2, XML keyword queries usually have ambiguities in
interpreting the search for node(s) and search via node(s), due
to three reasons below.
² Ambiguity 1: A keyword can appear both as an XML
tag name and as a text value of some other nodes.
² Zhifeng Bao, Tok Wang Ling and Bo Chen are with the School of
Computing, National University of Singapore.
E-mail: fbaozhife, lingtw, chenbog[at]comp.nus.edu.sg
² Jiaheng Lu is with the School of Information and DEKE, MOE in Renmin
University of China. Contact Author.
E-mail: jiahenglu[at]ruc.edu.cn
² Ambiguity 2: A keyword can appear as the text values
of different types of XML nodes and carry different
meanings.
² Ambiguity 3: A keyword can appear as an XML tag
name in different contexts and carry different meanings.
For example see the XML document in Figure 1, keywords
customer and interest appear as both an XML tag name and a
text value (e.g. value of the title for book B1); art appears as
a text value of interest, address and name node; name appears
as the tag name of the name of both customer and publisher.
Regarding to Issue 3, the search intention for a keyword
query is not easy to determine and can be ambiguous, because
the search via condition is not unique; so how to measure the
confidence of each search intention candidate, and rank the
individual matches of all these candidates are challenging.
Although many research efforts have been conducted in
XML keyword search [8], [10], [29], [22], [12], [23], none of
them has addressed and resolved the above three issues yet.
For instance, one widely adopted approach so far is to find
the smallest lowest common ancestor (SLCA) of all keywords
[29]. Each SLCA result of a keyword query contains all
query keywords but has no subtree which also contains all
the keywords.
In particular, regarding to Issue 1 and 2, SLCA may introduce
answers that are either irrelevant to user search intention,
or answers that may not be meaningful or informative enough.
E.g. when a query “Jim Gray” that intends to find Jim Gray’s
publications on DBLP [17] is issued, SLCA returns only the
author elements containing both keywords. Besides, SLCA
also returns publications written by two authors where “Jim” is
a term in 1st author’s name and “Gray” is a term in 2nd author,
and publications with title containing both keywords. It is
reasonable to return such results because search intention may
not be unique; however they should be given a lower rank, as
they are not matches of the major search intention. Regarding
to Issue 3, no existing approach has studied the problem of
relevance oriented result ranking in depth yet. Moreover, they
don’t perform well on pure keyword query when the schema
information of XML data is not available [22]. The actual
reason is, none of them can solve the above keyword ambiguity
problems, as demonstrated by the following example.
Example 1: Consider a keyword query “customer interest
art” issued on the bookstore data in Figure 1, and most likely
it intends to find the customers who are interested in art.
If adopting SLCA, we will get 5 results, which include
the title of book B1 and the customer nodes with IDs from
C1 to C4 (as these four customer nodes contain “customer”,
“interest” and “art” in either the tag names or node values) in
Figure 1. Since SLCA cannot well address the search intention,
all these 5 results are returned without any ranking applied.
However, only C4 is desired which should be put as the top
ranked one, and C2 is less relevant, as his interest is “street
art” rather than “art”, while C1 and C3 are irrelevant. ¤
Inspired by the great success of IR approach on web search
(especially its distinguished ranking functionality), we aim to
achieve similar success on XML keyword search, to solve the
above three issues without using any schema knowledge.
The main challenge we are going to solve is how to extend
the keyword search techniques in text databases (IR) to XML
databases, because the two types of databases are different.
First, the basic data units in text databases are flat documents.
For a given query, IR systems compute a numeric score for
each document and rank the document by this score. In XML
databases, however, information is stored in hierarchical tree
structures. The logical unit of answers needed by users is not
limited to individual leaf nodes containing keywords, but a
subtree instead. Second, unlike text database, it is difficult
to identify the (major) user search intention in XML data,
especially when the keywords contain ambiguities mentioned
before. Third, effective ranking is a key factor for the success
of keyword search. There may be dozens of candidate answers
for an ordinary keyword query in a medium-sized database.
E.g. in Example 1, five subtrees can be the query answers,
but they are not equally useful to user. Due to the difference
in basic answer unit between document search and database
search, in XML database we need to assign a single ranking
score for each subtree of certain category with a fitting size,
in order to rank the answers effectively.
Statistics is a mathematical science pertaining to the collection,
analysis, interpretation or explanation of data; it can be
used to objectively model a pattern or draw inferences about
the underlying data being studied. Although keyword search is
a subjective problem that different people may have different
interpretations on the same keyword query, statistics provides
an objective way to distinguish the major search intention(s).
It motivates us to model the search engine as a domain
expert who automatically interprets user’s all possible search
intention(s) through analyzing the statistics knowledge of
underlying data. As a result, we propose a novel IR-style
approach which well captures XML’s hierarchical structure,
and works well on pure keyword query independent of any
schema information of XML data. A search engine prototype
called XReal is implemented to achieve effective identification
of user search intention and relevance oriented ranking for the
search results in the presence of keyword ambiguities.
Example 2: We use the query in Example 1 again to
explain how XReal infers user’s desired result and puts it
as a top-ranked answer. XReal interprets that user desires to
search for customer nodes, because all three keywords have
high frequency of occurrences in customer nodes. Similarly,
since keywords “interest” and “art” have high frequency of
occurrences in subtrees rooted at interest nodes, it is considered
with high confidence that this query wants to search
via interest nodes, and incorporate this confidence into our
ranking formula. Besides, customers interested in “art” should
be ranked before those interested in (say) “street art”. As a
result, C4 is ranked before C2, and further before customers
with address in “art street”(e.g. C1) or named “art” (e.g. C3).¤
To our best knowledge, we are the first that exploit the
statistics of underlying XML database to address search intention
identification, result retrieval and relevance oriented
ranking as a single problem for XML keyword search. The
major contributions are summarized as follows:
1) This is the first work that addresses the keyword ambiguity
problem. We also identify three crucial issues that
an effective XML keyword search engine should meet.
2) We define our own XML TF (term frequency) and XML
DF (document frequency), which are cornerstones of all
formulae proposed later.
3) We propose three important guidelines in identifying the
user desired search for node type, and design a formula
to compute the confidence level of a certain node type
to be a desired search for node based on the guidelines.
4) We design formulae to compute the confidence of each
candidate node type as the desired search via node to
model natural human intuitions, in which we take into
account the pattern of keywords co-occurrence in query.
5) We propose a novel relevance oriented ranking scheme
called XML TF*IDF similarity which can capture the
hierarchical structure of XML and resolve Ambiguity
1-3 in a heuristic way. Besides, the popularity of query
results is designed to distinguish the results with comparable
relevance scores.
6) We implement the proposed techniques in a keyword
search engine prototype called XReal. Extensive experiments
show its effectiveness, efficiency and scalability.
The rest of the paper is organized as follows. We present the
related work in Section 2, and data model in Section 3. Section
4 infers user search intention. Section 5 discusses the ranking
scheme. Section 7 presents the search algorithms. Experiment
is discussed in Section 8 and we conclude in Section 9.