Scoring, term weighting and the vector space model

**seminar tips** · 05-10-2012, 05:27 PM

Scoring, term weighting and the vector space model

.pdf

Scoring, term weighting and the.pdf (Size: 298.69 KB / Downloads: 106)

Parametric and zone indexes

We have thus far viewed a document as a sequence of terms. In fact, most
documents have additional structure. Digital documents generally encode,
in machine-recognizable form, certain metadata associated METADATA with each document.
By metadata, we mean specific forms of data about a document, such
as its author(s), title and date of publication. This metadata would generally
FIELD include fields such as the date of creation and the format of the document, as
well the author and possibly the title of the document. The possible values
of a field should be thought of as finite – for instance, the set of all dates of
authorship.
Consider queries of the form“find documents authored byWilliamShakespeare
in 1601, containing the phrase alas poor Yorick”. Query processing then
consists as usual of postings intersections, except that we may merge postPARAMETRIC
INDEX ings from standard inverted as well as parametric indexes. There is one parametric
index for each field (say, date of creation); it allows us to select only
the documents matching a date specified in the query. Figure 6.1 illustrates
the user’s view of such a parametric search. Some of the fields may assume
ordered values, such as dates; in the example query above, the year 1601 is
one such field value. The search engine may support querying ranges on
such ordered values; to this end, a structure like a B-treemay be used for the
field’s dictionary.
ZONE Zones are similar to fields, except the contents of a zone can be arbitrary
free text. Whereas a field may take on a relatively small set of values, a zone
can be thought of as an arbitrary, unbounded amount of text. For instance,
document titles and abstracts are generally treated as zones. We may build a
separate inverted index for each zone of a document, to support queries such
as “find documents with merchant in the title and william in the author list and
the phrase gentle rain in the body”. This has the effect of building an index
that looks like Figure 6.2. Whereas the dictionary for a parametric index
comes from a fixed vocabulary (the set of languages, or the set of dates), the
dictionary for a zone index must structure whatever vocabulary stems from
the text of that zone.
In fact, we can reduce the size of the dictionary by encoding the zone in
which a term occurs in the postings. In Figure 6.3 for instance, we show how
occurrences of william in the title and author zones of various documents are
encoded. Such an encoding is useful when the size of the dictionary is a
concern (because we require the dictionary to fit in main memory). But there
is another important reason why the encoding of Figure 6.3 is useful: the
WEIGHTED ZONE efficient computation of scores using a technique we will call weighted zone
SCORING scoring.

Weighted zone scoring

Thus far in Section 6.1 we have focused on retrieving documents based on
Boolean queries on fields and zones. We now turn to a second application of
zones and fields.
Given a Boolean query q and a document d, weighted zone scoring assigns
to the pair (q, d) a score in the interval [0, 1], by computing a linear combination
of zone scores, where each zone of the document contributes a Boolean
value. More specifically, consider a set of documents each of which has ℓ
zones. Let g1, . . . , gℓ ∈ [0, 1] such that åℓ
i=1 gi = 1. For 1 ≤ i ≤ ℓ, let si be the
Boolean score denoting a match (or absence thereof) between q and the ith
zone. For instance, the Boolean score from a zone could be 1 if all the query
term(s) occur in that zone, and zero otherwise; indeed, it could be any Boolean
function that maps the presence of query terms in a zone to 0, 1.

Term frequency and weighting
Thus far, scoring has hinged on whether or not a query term is present in
a zone within a document. We take the next logical step: a document or
zone that mentions a query term more often has more to do with that query
and therefore should receive a higher score. To motivate this, we recall the
notion of a free text query introduced in Section 1.4: a query in which the
terms of the query are typed freeform into the search interface, without any
connecting search operators (such as Boolean operators). This query style,
which is extremely popular on the web, views the query as simply a set of
words. A plausible scoring mechanism then is to compute a score that is the
sum, over the query terms, of the match scores between each query term and
the document.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Software Test Factory (A proposal of a process model to create a Test Factory) Semi	seminar code	1	680	15-09-2017, 01:25 PM Last Post: jaseela123
	OSI Reference Model	seminar tips	1	748	11-09-2017, 03:38 PM Last Post: jaseela123
	Fuzzy Control Model Optimization for Behavior-Constant Traffic	seminar paper	1	2,243	06-09-2017, 02:46 PM Last Post: jaseela123
	OBJECT ORIENTED ANALYSIS AND DESIGN TWO MARK AND SIXTEEN MARK Q and A	seminar ideas	1	1,982	29-08-2017, 11:23 AM Last Post: jaseela123
	Short Term Load Forecasting with Fuzzy Logic Systems Report	seminar flower	1	963	28-08-2017, 05:00 PM Last Post: jaseela123
	Optical Free Space Communication	Computer Science Clay	0	10,364,880	25-08-2017, 09:32 PM Last Post: Computer Science Clay
	Free Space Optics	Computer Science Clay	0	12,463,887	25-08-2017, 09:32 PM Last Post: Computer Science Clay
	B-ISDN Reference Model	computer science crazy	0	16,847,383	25-08-2017, 09:32 PM Last Post: computer science crazy
	Pivot Vector Space Approach in Audio-Video Mixing	computer science crazy	0	11,011,943	25-08-2017, 09:32 PM Last Post: computer science crazy

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.