31-08-2016, 10:30 AM
1451633608-REVIEWPAPERONVECTORSPACEMODEL.docx (Size: 313.49 KB / Downloads: 5)
ABSTARCT
Information retrieval is become animportant research area in the field of computer science. Information retrieval (IR) is generally concerned with the searching and retrieving of knowledge-based information from database. In this paper, we represent the Vector Space Model for information retrieval. In this Review paper we are describing and compare different VSM model techniques and indexing methods for reducing search space for retrieving information. Paper describes the different VSM models like simplest VSM, improved vector placement in simplest VSM with term weighting and inverse document frequency(TF-IDF) also contains the motivation of VSM, literature review and analysis of existing VSM model work.
INTRODUCTION
Information retrieval is generally considered as a subfield of computer science that deals with the representation, storage, and access of information. Information retrieval is concerned with the organization and retrieval of information from large database collections.Information Retrieval is the process by which a collection of data is represented, stored, and searched for the purpose of knowledge discovery as a response to a user request (query) this process involves various stages initiate with representing data and ending with returning relevant information to the user. Intermediate stage includes filtering, searching, matching and ranking operations. The main goal of information retrieval system (IRS) is to “finding relevant information or a document that satisfies user information needs”. To achieve this goal, IRSs usually implement following processes:
1) In indexing process the documents are represented in summarized content form.
2) In filtering process all the stop words and common words are remove.
3) Searching is the core process of IRS. There are various techniques for retrieving documents that match with users need.
MOTIVATION OF VSM
Enormous amount of text material is increasing at exponential rate, especially with the increasing use and applications of Internet. Day by day it is becoming very difficult to retrieve the relevant information. Various approaches have been used by the researchers to get over the relevancy factor in information retrieval.
Boolean model matches query with precise semantics in the document collection by Boolean operations with operators AND, OR, NOT. It predicts either relevancy or non-relevancy of each document, leading to the disadvantage of retrieving very few or very large documents. The Boolean model is the lightest model having inability of partial matching which leads to poor performance in retrieval of information. Non-binary weights are used to weight the index terms in queries and in documents.
INTRODUCTION OF VSM MODEL
Gerard Salton and his colleagues suggested a model based on Luhn's similarity criterion that has
a strongtheoretical motivation (Salton and McGill 1983). They considered the index representations and the query as vectors embedded in a high dimensional Euclidean space, where each term is assigned a separate dimension. The vector space model can best be characterized by its attempt to rank documents by the similarity between the query and each document. In the Vector Space Model (VSM), documents and query are represent as a Vector and the angle between the two vectors are computed using the similarity cosine function.
The idea underlying this model is the following. First, each term ti in the dictionaryV is represented by means of a base in a |V |-dimensional orthonormal space. In other words, each ti contains all 0s but one 1 corresponding to the dimension associated with ti in the space: for instance, in a 10-dimensional vector space where the term “jazz” is mapped to the third dimension, we will have tjazz = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0}.
Any query q and document dj ∈D may be represented in the vector space as:
wherewiq and wij are weights assigned to terms ti for the query q and for eachdocument dj , according to a chosen weighting scheme.
Document vectors and query vectors can be usedto compute a degree of similarity between documents or between documents and queries. Such a similarity can intuitively be represented as the projection of one vector over another, mathematically expressed in terms of the dot product; a query document similarity score may therefore be computed as:
SC(dj, q) = sim(dj, q) = dj • q
1) Dimension Instantiation: Bag of Words (BOW)
It represent theVocabulary V=(w1, …,wN) which contains the whole collection of documents words.
2) Vector Placement: Bit Vector
In simplest VSM vector placement is represented by bit vector that is if word of query or document present in vocabulary then 1 if not then 0.
xi,yi is from {0,1}, where xi is term weight for words in query and yi is term weight for words in documents.
3) Similarity Instantiation: Dot Product
In simplest VSM model we use simple dot product of query vector and document vector
1) COSINE SIMILARITY:
The Euclidean dot product in the orthonormal vector space defined by the terms in ti ∈V represents an example of such a metric, also called cosine similarity:
Cosine similarity is a function of the size of the angle α formed by dj and q inthe space:
2) JACCARD SIMILARITY
The Jaccard index, also known as the Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
Similarity functionin VSM using jaccard similarity:
Problem with this similarity is that It doesn’t consider term frequency (how many times a term occurs in a document)Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information.
3) DICE SIMILARITY
To provide improvement then traditional IR system dice similarity has been use. It uses dissociated cross over and point mutation gives the highest improvement over traditional approach.
Limitations of simplest VSM existing work:
1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)
2. Search keywords must precisely match document terms; word substring might result in a "false positive match"
3. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
4. The order in which the terms appear in the document is lost in the vector space representation.
5. Theoretically assumes terms are statistically independent.
6. Weighting is intuitive but not very formal.