14-06-2012, 02:28 PM
Implementation of Conflation algorithm.
CONFLATION_writeup.doc (Size: 27.73 KB / Downloads: 56)
Introduction to Information Retrieval
In today’s information explosion era, increase in demand for quicker dissemination of
information, from contents stored in a variety of forms requires speedy search and timely
retrieval. The values of documents are measured according to the information it contains but
they are proved useless until the stored information is brought out for use by the readers. This
may be either by subject analysis or representation of the terms through symbols. It has
always been the need of the scholars and the lingering turmoil in the minds of library
organizers, to suitably facilitate the extraction of the contents expeditiously and exhaustively
that has brought forward the concept of information retrieval.
Meaning & Definition:
Calvin Mooers coined the term information retrieval in 1950. In the context of library and
information science, we mean to get back information, which is, in a way, hidden, from
normal sight or vision. According to J.H. Shera: It is, "the process of locating and selecting
data, relevant to a given requirement."
Calvin Mooers: "Searching and retrieval of information from storage, according to
specification by subject."
Functions:
The major functions that constitute an information retrieval system, comprises of: Acquisition,
Analysis, Representation of information, Organisation of the indexes, Matching, Retrieving,
Readjustment and Feedback
Components of Information Retrieval System:
A study of the functions of IRS brings forth some of the essential components that constitute
the proper functioning of the system. According to Lancaster, an information retrieval system
consists of six basic subsystems. They are as follows:
1. The document selection subsystem
2. The indexing subsystem
3. The vocabulary subsystem
4. The searching subsystem
5. The user-system interface
6. The marching subsystem
All the above subsystems may be grouped under two groups' subject/content analysis and
search strategy. Subject or content analysis includes the task of analysis, organisation and
storage of information. Search strategy includes analysis of user queries, creation of search
formula and the actual searching.
Conflation Algorithm
Ultimately one would like to develop a text processing system which by means of computable methods with the minimum of human intervention will generate from the input text (full text,abstract, or title) a document representative adequate for use in an automatic retrieval system.This is a tall order and can only be partially met. A document will be indexed by a name if one of its significant words occurs as a member of that class.
Such a system will usually consist of three parts:
(1) removal of high frequency words,
(2) suffix stripping,
(3) detecting equivalent stems.
The removal of high frequency words, 'stop' words or 'fluff' words is one way of
implementing Luhn's upper cut-off. This is normally done by comparing the input text with a
'stop list' of words which are to be removed. The advantages of the process are not only that
non-significant words are removed and will therefore not interfere during retrieval, but also
that the size of the total document file can be reduced by between 30 and 50 per cent.
INPUT:
1. A text file containing stop words
2. A document which is searched and index according to frequency of words
OUTPUT:
Document containing frequently appearing words without stop words and removing
Stemming.
OPERATIONAL STEPS REQUIRED: Steps Required for Conflation algorithm which are as follows:-
1) Removal of High Frequency words.
2) Suffix Stripping (Stemming).
ADVANTAGES:
The advantages of the algorithm are not only that non-significant words are removed and will therefore not interfere during retrieval, but also that the size of the total document file can be reduced by between 30 and 50 per cent.