10-08-2012, 10:55 AM
Effective Navigation on Query Results of iomedical Databases
BioNav ICDE09 CR.pdf (Size: 274.8 KB / Downloads: 19)
INTRODUCTION
The last decade has been marked by unprecedented growth
in both the production of biomedical data and the amount of
published literature discussing it. The MEDLINE database, on
which the PubMed search engine operates, contains over 18
million citations, and the database is currently growing at the
rate of 500,000 new citations each year [7]. Biologists,
chemists, medical and health scientists are used to searching
their domain literature – such as PubMed – using a keyword
search interface. Currently, in an exploratory scenario where
the user tries to find citations relevant to her line of research
and hence not known a priori, she submits an initially broad
keyword-based query that typically returns a large number of
results. Subsequently, the user iteratively refines the query, if
she has an idea of how to, by adding more keywords, and resubmits
it, until a relatively small number of results are
returned. This refinement process is problematic because after
a number of iterations the user is not aware if she has overspecified
the query, in which case relevant citations might be
excluded from the final query result.
Definition 1 (Concept Hierarchy):
A Concept Hierarchy
, , is a labeled tree consisting of a set of concept
nodes, a set of edges and is rooted at node . Each node
has a label
and a unique identifier id.
According to the semantics of the MeSH concept hierarchy
[13], the label of a child concept node is more specific than
the one of its parent. This also holds for most concept
hierarchies.
Once the user issues a keyword query, PubMed−BioNav
uses the Entrez Programming Utilities (eUtils) [4]−returns a
list of citations, each associated with several MeSH concepts.
BioNav constructs an Initial Navigation Tree by attaching to
each concept node of the MeSH concept hierarchy a list of its
associated citations. Formally, an Initial Navigation Tree
, , is a concept hierarchy, where every node (concept)
is additionally labelled with a results (citations) list
CONCLUSIONS AND FUTURE WORK
Information overload is a major problem when searching
biomedical databases like PubMed, where typically a large
number of citations are returned, of which only a small subset
is relevant to the user. In this paper, we presented the BioNav
system to address this problem. Our solution is to organize the
query results according to their associations to concepts of the
MeSH concept hierarchy, and provide a dynamic navigation
method that minimizes the information overload observed by
the user. In the future, we will develop heuristic algorithms to
improve the response time for expand operations in BioNav
and work with domain experts to refine the navigation cost
model.
ALGORITHMS FOR BEST EDGECUT
Given the cost equation described in Section III and [2], we
can compute the optimal cost by recursively enumerating all
possible sequences of valid EdgeCuts, starting from the root
and reaching every concept in the navigation tree, computing
the cost for each step and taking the minimum. However, this
algorithm is prohibitively expensive. Instead we propose an
alternative algorithm Opt-EdgeCut that makes use of the
dynamic programming technique to reduce the computation
cost. Opt-EdgeCut is still exponential.
The Opt-EdgeCut algorithm to compute the minimum
expected navigation cost (and the EdgeCut that achieves it)
traverses the navigation tree in post-order and computes the
navigation cost bottom-up starting from the leaves. For each
node , the algorithm enumerates and stores the list of
all possible EdgeCuts for the subtree rooted at , and the list
of all possible sets that node can be annotated
with. The algorithm then computes the minimum cost for each
subtree in given the EdgeCuts in and the already
computed minimum costs for the descendants of .
.