19-01-2013, 12:34 PM
XML retrieval
XML retrieval.pdf (Size: 387.83 KB / Downloads: 101)
INTRODUCTION
Information retrieval systems are often contrasted with relational databases.
Traditionally, IR systems have retrieved information from unstructured text
– by which we mean “raw” text without markup. Databases are designed
for querying relational data: sets of records that have values for predefined
attributes such as employee number, title and salary. There are fundamental
differences between information retrieval and database systems in terms of
retrievalmodel, data structures and query language as shown in Table 10.1.1
Some highly structured text search problems are most efficiently handled
by a relational database, for example, if the employee table contains an attribute
for short textual job descriptions and you want to find all employees
who are involved with invoicing.
Basic XML concepts
An XML document is an ordered, labeled tree. Each node of the tree is an
XML element and is written with an opening and closing tag. An element can
have one or more XML attributes. In the XML document in Figure 10.1, the
scene element is enclosed by the two tags <scene ...> and </scene>. It
has an attribute number with value vii and two child elements, title and verse.
Figure 10.2 shows Figure 10.1 as a tree. The leaf nodes of the tree consist of
text, e.g., Shakespeare, Macbeth, and Macbeth’s castle. The tree’s internal nodes
encode either the structure of the document (title, act, and scene) or metadata
functions (author).
Challenges in XML retrieval
In this section, we discuss a number of challenges that make structured retrieval
more difficult than unstructured retrieval. Recall from page 195 the
basic setting we assume in structured retrieval: the collection consists of
structured documents and queries are either structured (as in Figure 10.3)
or unstructured (e.g., summer holidays).
The first challenge in structured retrieval is that users want us to return
parts of documents (i.e., XML elements), not entire documents as IR systems
usually do in unstructured retrieval. If we query Shakespeare’s plays for
Macbeth’s castle, should we return the scene, the act or the entire play in Figure
10.2? In this case, the user is probably looking for the scene. On the other
hand, an otherwise unspecified search for Macbeth should return the play of
this name, not a subunit.