INFORMATION RETREIVAL ppt.

seminar class · 11-03-2011, 02:21 PM

ch19.ppt (Size: 258 KB / Downloads: 59)
Information Retrieval Systems
n Information retrieval (IR) systems use a simpler data model than database systems
l Information organized as a collection of documents
l Documents are unstructured, no schema
n Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents
l e.g., find documents containing the words “database systems”
n Can be used even on textual descriptions provided with non-textual data such as images
n Web search engines are the most familiar example of IR systems
n Differences from database systems
l IR systems don’t deal with transactional updates (including concurrency control and recovery)
l Database systems deal with structured data, with schemas that define the data organization
l IR systems deal with some querying issues not generally addressed by database systems
n Approximate searching by keywords
n Ranking of retrieved answers by estimated degree of relevance
Keyword Search
n In full text retrieval, all the words in each document are considered to be keywords.
l We use the word term to refer to the words in a document
n Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not
l Ands are implicit, even if not explicitly specified
n Ranking of documents on the basis of estimated relevance to a query is critical
l Relevance ranking is based on factors such as
 Term frequency
– Frequency of occurrence of query keyword in document
 Inverse document frequency
– How many documents the query keyword occurs in
» Fewer è give more importance to keyword
 Hyperlinks to documents
– More links to a document è document is more important
Relevance Ranking Using Terms
n TF-IDF (Term frequency/Inverse Document frequency) ranking:
l Let n(d) = number of terms in the document d
l n(d, t) = number of occurrences of term t in the document d.
l Relevance of a document d to a term t
 The log factor is to avoid excessive weight to frequent terms
Relevance of document to query Q
n Most systems add to the above model
l Words that occur in title, author list, section headings, etc. are given greater importance
l Words whose first occurrence is late in the document are given lower importance
l Very common words such as “a”, “an”, “the”, “it” etc are eliminated
 Called stop words
l Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart
n Documents are returned in decreasing order of relevance score
l Usually only top few documents are returned, not all
Similarity Based Retrieval
n Similarity based retrieval - retrieve documents similar to a given document
l Similarity may be defined on the basis of common words
 E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents.
n Relevance feedback: Similarity can be used to refine answer set to keyword query
l User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these
n Vector space model: define an n-dimensional space, where n is the number of words in the document set.
l Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t )
l The cosine of the angle between the vectors of two documents is used as a measure of their similarity.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Human Computer Interface : Seminar Report and PPT	seminar post	1	1,337	22-09-2017, 11:23 AM Last Post: jaseela123
	4G Broadband : Seminar Report and PPT	study tips	1	1,261	22-09-2017, 11:19 AM Last Post: jaseela123
	Software Life-Cycle Models ppt	seminar flower	1	3,852	22-09-2017, 10:54 AM Last Post: jaseela123
	PPT ON LINUX	project girl	1	1,829	21-09-2017, 03:56 PM Last Post: jaseela123
	Public Key Infrastructure (Digital Certificates and Digital Signatures) PPT	project girl	1	2,364	21-09-2017, 01:18 PM Last Post: jaseela123
	Itanium Processor : Seminar Report and PPT	seminar projects maker	1	1,052	21-09-2017, 12:46 PM Last Post: jaseela123
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Biometric Authentication PPT	project girl	1	1,109	19-09-2017, 02:32 PM Last Post: jaseela123
	Android Interface Definition Language PPT	project girl	1	1,681	19-09-2017, 10:58 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.