17-07-2012, 02:11 PM
Effective Navigation of Query Results Based on Concept Hierarchies
Effective Navigation of Query Results Based on Concept Hierarchies.pdf (Size: 963.34 KB / Downloads: 57)
Abstract
Search queries on biomedical databases, such as PubMed, often return a large number of results, only a small
subset of which is relevant to the user. Ranking and categorization, which can also be combined, have been proposed to
alleviate this information overload problem. Results categorization for biomedical databases is the focus of this work. A natural
way to organize biomedical citations is according to their MeSH annotations. MeSH is a comprehensive concept hierarchy used
by PubMed. In this paper, we present the BioNav system, a novel search interface that enables the user to navigate large
number of query results by organizing them using the MeSH concept hierarchy. First, the query results are organized into a
navigation tree. At each node expansion step, BioNav reveals only a small subset of the concept nodes, selected such that the
expected user navigation cost is minimized. In contrast, previous works expand the hierarchy in a predefined static manner,
without navigation cost modeling. We show that the problem of selecting the best concepts to reveal at each node expansion is
NP-complete and propose an efficient heuristic as well as a feasible optimal algorithm for relatively small trees. We show
experimentally that BioNav outperforms state-of-the-art categorization systems with respect to the user navigation cost. We
have implemented BioNav for the MEDLINE database at http://db.cse.buffalo.edu/bionav.
Index Terms—Interactive data exploration and discovery, Search process, Graphical user interfaces, Interaction styles.
1 INTRODUCTION
he last decade has been marked by unprecedented
growth in both the production of biomedical data and
the amount of published literature discussing it. The
MEDLINE database, on which the PubMed search engine
operates, contains over 18 million citations, and the database
is currently growing at the rate of 500,000 new citations
each year [19]. Other biological sources, such as Entrez
Gene [17] and OMIM [20], witness similar growth. As
claimed in previous work [25], the ability to rapidly survey
this literature constitutes a necessary step toward
both the design and the interpretation of any large scale
experiment. Biologists, chemists, medical and health
scientists are used to searching their domain literature –
such as PubMed– using a keyword search interface. Currently,
in an exploratory scenario where the user tries to
find citations relevant to her line of research and hence
not known a priori, she submits an initially broad keyword-
based query that typically returns a large number
of results. Subsequently, the user iteratively refines the
query, if she has an idea of how to, by adding more keywords,
and re-submits it, until a relatively small number
of results are returned. This refinement process is problematic
because after a number of iterations the user is not
aware if she has over-specified the query, in which case
relevant citations might be excluded from the final query
result.
As an example, a query on PubMed for “cancer” returns
more than 2 million citations. Even a more specific
query for “prothymosin”, a nucleoprotein gaining attention
for its putative role in cancer development, returns
313 citations. The size of the query result makes it difficult
for the user to find the citations that she is most interested
in, and a large amount of effort is expended searching for
these results. Many solutions have been proposed to address
this problem –commonly referred to as information
overload [1,2,3,9]. These approaches can be broadly classified
into two classes: ranking and categorization, which
can also be combined.
BioNav belongs primarily to the categorization class,
which is ideal for this domain given the rich concept hierarchies
(e.g., MeSH [18]) available for biomedical data.
We augment our categorization techniques with simple
ranking techniques. BioNav organizes the query results
into a dynamic hierarchy, the navigation tree. Each concept
(node) of the hierarchy has a descriptive label. The user
then navigates this tree structure, in a top-down fashion,
exploring the concepts of interest while ignoring the rest.
An intuitive way to categorize the results of a query on
PubMed is using the MeSH static concept hierarchy [18],
thus utilizing the initiative of the US National Library of
Medicine (NLM) to build and maintain such a comprehensive
structure. Each citation in MEDLINE is associated
with several MeSH concepts in two ways: (i) by being
explicitly annotated with them, and (ii) by mentioning
those in their text (see Section 7 for details). Since these
associations are provided by PubMed, a relatively
straightforward interface to navigate the query result
would first attach the citations to the corresponding
MeSH concept nodes and then let the user navigate the
xx-xx/0x/$xx.00 © 200x IEEE
————————————————
• Abhijith Kashyap is with the Department of Computer Science and Engineering,
University at Buffalo, SUNY, Buffalo, NY 14260. E-mail:
rk39[at]cse.buffalo.edu.
• Vagelis Hristridis is with the School of Computing and Information
Sciences, Florida International University, Miami FL 33199. E-mail: vagelis@
cis.fiu.edu.
• Michalis Petropoulos is with the Department of Computer Science and
Engineering, University at Buffalo, SUNY, Buffalo, NY 14260. E-mail:
mpetropo[at]cse.buffalo.edu.
• Sotiria Tavoulari is with the Department of Pharmacology, Yale University,
New Haven, CT 06520. E-mail: sotiria.tavoulari[at]yale.edu.
Manuscript received (insert date of submission if desired).
T
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID
navigation tree. Fig. 1 displays a snapshot of such an interface
where shown next to each node label is the count
of distinct citations in the subtree rooted at that node. A
typical navigation starts by revealing the children of the
root ranked by their citation count, and is continued by
the user expanding on or more of them, revealing their
ranked children and so on, until she clicks on a concept
and inspects the citations attached to it. A similar interface
and navigation method is used by e-commerce sites,
such as Amazon and eBay. For this example, we assume
that the user will navigate to the three indicated concepts
corresponding to three independent lines of research related
to prothymosin.
MESH (313)
Amino Acids, Peptides, and Proteins (310)
Proteins (307)
Nucleoproteins (40)
Biological Phenomena, Cell Phenomena, and Immunity (217)
Cell Physiology (161)
Cell Growth Processes (99)
Genetic Processes (193)
Gene Expression (92)
Transcription, Genetic (25)
95 more nodes
2 more nodes
45 more nodes
4 more nodes
3 more nodes
15 more nodes
10 more nodes
1 more node
Histones (15)
Fig. 1. Static Navigation on the MeSH Concept Hierarchy1
The above static same for every query result navigation
method is problematic when the MeSH hierarchy (or one
with similar properties) is used for categorization for the
following reasons:
• The massive size of the MeSH hierarchy (over
48,000 concept nodes) makes it challenging for the
users to effectively navigate to the desired concepts
and browse the associated records. Even if
we remove from the MeSH concept nodes with no
citations attached to them, the 313 distinct query
results for “prothymosin” are attached to 3,940
nodes, which is the actual size of the navigation
tree in Fig. 1. Combined with the fact that the
MeSH hierarchy is quite bushy on the upper levels,
this means that the user has to inspect, for
example, a total of 152 concept nodes before she
reaches the indicated concept “Histones”; a number
comparable to the distinct citation count in the
query result. A common practice [27] for hierarchy
navigation is to show only a subset of a node’s
children, which would be appropriate if only few
nodes contain many results. Unfortunately, this is
not the case for the MeSH navigation tree; most of
the 98 children of the root in Fig. 1 have many results
(the first three shown have 310, 217 and 193).