18-07-2014, 02:46 PM
Advanced Search Engine using LSI & PLSI with NLG
Advanced Search Engine.pdf (Size: 832.1 KB / Downloads: 10)
ABSTRACT
First, synonyms and terms similar to keywords are
not taken into consideration to search web pages. Users may need to input several similar
keywords individually to complete a search. Second, traditional search engines treat all keywords
as the same importance and cannot differentiate the importance of one keyword from that of
another. n this project, we are going to develop a semantic based Search Engine that automatically
retrieves web pages that contain synonyms or terms similar to keywords. Our Search Engine not
only gives the rankings/ratings of the web pages, but also creates summaries of each of the top
ranked pages and displays them to the user. The user can simply go through the summary and if
he wants more informationabout that webpage, he can go to that particular website for more
information. We have included another feature that extracts summary of all the pages about that
keyword and display it to the user so that user can know about that keyword in one go
INTRODUCTION
In recent years, search engine technology had to scale up dramatically in order to keep up with
the growing amount of information available on the web. We propose a new method for further
improving targeted web information retrieval (IR) by combining text with link analysis and make
novelty comparisons against existing methods.
Latent semantic indexing is an indexing and retrieval method that uses a mathematical technique
called singular value decomposition (SVD) to identify patterns in the relationships between the
terms and concepts contained in an unstructured collection of text. LSI is based on the principle
that words that are used in the same contexts tend to have similar meanings. A key feature of LSI
is its ability to extract the conceptual content of a body of text by establishing associations
between those terms that occur in similar contexts.
Probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and
co-occurrence data. In effect, one can derive a low dimensional representation of the observed
variables in terms of their affinity to certain hidden variables, just as in latent semantic indexing.
PLSI evolved from latent semantic indexing.Compared to standard latent semantic indexing which stems from linear algebra and downsizes
the occurrence tables (usually via singular value decomposition), probabilistic latent semantic
indexing is based on a mixture decomposition derived from a latent class model
LITERATURE SURVEY
The tremendous growth in the volume of data and with the terrific growth of number of web
pages, traditional search engines now a day are not appropriate and not suitable anymore. Search
engine is the most important tool to discover any information in World Wide Web. Semantic
Search Engine is born of traditional search engine to overcome the above problem. The Semantic
Web is an extension of the current web in which information is given well-defined meaning.
Semantic web technologies are playing a crucial role in enhancing traditional web search, as it is
working to create machine readable data, but it will not replace traditional search engineThe tremendous growth in the volume of data and with the terrific growth of number of web
pages, traditional search engines now a days are not appropriate and not suitable anymore.
Search engine is the most important tool to discover any information in World Wide Web.
Semantic Search Engine is born of traditional search engine to overcome the above problem. The
Semantic Web is an extension of the current web in which information is given well-defined
meaning. Semantic web technologies are playing a crucial role in enhancing traditional web
search, as it is working to create machine readable data, but it will not replace traditional search
engine. In this paper we made a brief survey on various promising features of some of the best
semantic search engines developed so far and we have discussed the various approaches to
semantic search. We have summarized the techniques, advantages of some important semantic
web search engines that are developed so far
Search Engine
Crawling, indexing, processing, calculating relevancy, and retrieving.
There are basically three steps that are involved in the web crawling procedure. First, the search
bot starts by crawling the pages of your site. Then it continues indexing the words and content of
the site, and finally it visits the links (web page addresses or URLs) that are found in your site.
When the spider doesn‘t find a page, it will eventually be deleted from the index. Processing is
then applied to the indexed data so it can be made into readable/usable format for calculations
that are to take place using LSI or PLSI. Calculations will then produce results showing us the
relevance of the documents in ranking. Selection of the documents above a specific rank is made
and that data is retrieved.
Latent Semantic Indexing
Latent semantic indexing is an indexing and retrieval method that uses a mathematical technique
called singular value decomposition (SVD) to identify patterns in the relationships between the
terms and concepts contained in an unstructured collection of text. LSI is based on the principle
that words that are used in the same contexts tend to have similar meanings. A key feature of LSI
Term Document Matrix
LSI begins by constructing a term-document matrix, , to identify the occurrences of
the unique terms within a collection of documents. In a term-document matrix, each term is
represented by a row, and each document is represented by a column, with each matrix cell, ,
initially representing the number of times the associated term appears in the indicated document,
. This matrix is usually very large and very sparse.
Once a term-document matrix is constructed, local and global weighting functions can be applied
to it to condition the data. The weighting functions transform each cell, of , to be the
product of a local term weight, , which describes the relative frequency of a term in a
document, and a global weight, , which describes the relative frequency of the term within the
entire collection of documents.
Extraction Based Summarization
Various methods have been proposed to achieve extractive summarization. Most of them are
based on scoring of the sentences. Maximal Marginal Relevance scores the sentences according
to their relevance to the query, Mutual Reinforcement Principle for Summary generation uses
clustering of sentences to score them according to how close they are to the central theme. QR
decomposition method scores the sentences using column pivoting. The sentences can also be
scored by certain predefined features. These features may include linguistic features and
statistical features, such as location, rhetorical structure, presence or absence of certain syntactic
features and presence of proper names, and statistical measures of term prominence. Rough set
based extractive summarization has been proposed that aims at selecting important sentences
from a given text using rough sets, which has been traditionally used to discover patterns hidden
in data. Methods using similarity between sentences and measures of prominence of certain
semantic concepts and relationships to generate an extractive summary have also been proposed.
Some commercially available extractive summarizers like Copernic and Word summarizers use
certain statistical algorithms to create a list of important concepts and hence generate a summary.
We propose to achieve extractive summarization as a three-step process:
Crawler
A Web crawler is a program/software that starts with a list of URLs to visit, called the seeds. As
the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list
of URLs to visit, called ―the crawl frontier‖. URLs from the frontier are recursively visited
according to a set of policies. Typical Search Engines run thousands of instances of their web
crawling programs simultaneously, on multiple servers. When a web crawler visits one of the
pages, it loads the site‘s content into a database. Once a page has been fetched, the text of that
page is loaded into the search engine‘s index, which is a massive database of words, and where
they occur on different web pages
CONCLUSION
This project gives an idea of making a Semantic based Search Engine that uses various
approaches to yield a new and useful search experience for the users. Semantic search has the
power to enhance the traditional web search since it looks for the relationship shared by
documents and words within them. Also search results should be displayed in a manner that a
user can understand what content a webpage contains just by seeing at the search results. Also if
a user wants to get information on any keyword, he should be shown overall summary of the
keywords by extracting necessary details from all the relevant web pages.