17-07-2014, 01:03 PM
Advanced Search Engine using LSI & PLSI with NLG
Advanced Search Engine.pdf (Size: 832.1 KB / Downloads: 15)
ABSTRACT
In this project, we are going to develop a semantic based Search Engine that automatically
retrieves web pages that contain synonyms or terms similar to keywords. Our Search Engine not
only gives the rankings/ratings of the web pages, but also creates summaries of each of the top
ranked pages and displays them to the user. The user can simply go through the summary and if
he wants more information about that webpage, he can go to that particular website for more
information. We have included another feature that extracts summary of all the pages about that
keyword and display it to the user so that user can know about that keyword in one go.
Firstly, a web crawler is built which crawls through each of the web pages and then creates a list
of keywords with its associated URLs. Here, we remove unwanted keywords using the process
of stemming and based on stop list. Then, the URLs and the keywords are inputted to the Latent
Semantic Indexing algorithm which gives ratings/rankings of the URLs. We also implement
probabilistic Latent Semantic Indexing for rating the pages. Then we would compare the results
of both the algorithms. These pages are read and summary is generated automatically based on
the Natural Language Generation algorithms to give a new and good feel to the user.
INTRODUCTION
In recent years, search engine technology had to scale up dramatically in order to keep up with
the growing amount of information available on the web. We propose a new method for further
improving targeted web information retrieval (IR) by combining text with link analysis and make
novelty comparisons against existing methods.
Latent semantic indexing is an indexing and retrieval method that uses a mathematical technique
called singular value decomposition (SVD) to identify patterns in the relationships between the
terms and concepts contained in an unstructured collection of text. LSI is based on the principle
that words that are used in the same contexts tend to have similar meanings. A key feature of LSI
is its ability to extract the conceptual content of a body of text by establishing associations
between those terms that occur in similar contexts.
Probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and
co-occurrence data. In effect, one can derive a low dimensional representation of the observed
variables in terms of their affinity to certain hidden variables, just as in latent semantic indexing.
PLSI evolved from latent semantic indexing.
Compared to standard latent semantic indexing which stems from linear algebra and downsizes
the occurrence tables (usually via singular value decomposition), probabilistic latent semantic
indexing is based on a mixture decomposition derived from a latent class model.
Automatic summarization is the process of reducing a text document with a computer program in
order to create a summary that retains the most important points of the original document. As the
problem of information overload has grown, and as the quantity of data has increased, so has
interest in automatic summarization. Technologies that can make a coherent summary take into
account variables such as length, writing style and syntax. An example of the use of
summarization technology is search engines such as Google. Document summarization is
another.
Generally, there are two approaches to automatic summarization: extraction and abstraction.
Extractive methods work by selecting a subset of existing words, phrases, or sentences in the
original text to form the summary. In contrast, abstractive methods build an internal se
LITERATURE SURVEY
The tremendous growth in the volume of data and with the terrific growth of number of web
pages, traditional search engines now a day are not appropriate and not suitable anymore. Search
engine is the most important tool to discover any information in World Wide Web. Semantic
Search Engine is born of traditional search engine to overcome the above problem. The Semantic
Web is an extension of the current web in which information is given well-defined meaning.
Semantic web technologies are playing a crucial role in enhancing traditional web search, as it is
working to create machine readable data, but it will not replace traditional search engine.
The tremendous growth in the volume of data and with the terrific growth of number of web
pages, traditional search engines now a days are not appropriate and not suitable anymore.
Search engine is the most important tool to discover any information in World Wide Web.
Semantic Search Engine is born of traditional search engine to overcome the above problem. The
Semantic Web is an extension of the current web in which information is given well-defined
meaning. Semantic web technologies are playing a crucial role in enhancing traditional web
search, as it is working to create machine readable data, but it will not replace traditional search
engine. In this paper we made a brief survey on various promising features of some of the best
semantic search engines developed so far and we have discussed the various approaches to
semantic search. We have summarized the techniques, advantages of some important semantic
web search engines that are developed so far. The most prominent part is that how the semantic
search engines differ from the traditional searches and their results are shown by giving a sample
query as input.
Search Engine
Crawling, indexing, processing, calculating relevancy, and retrieving.
There are basically three steps that are involved in the web crawling procedure. First, the search
bot starts by crawling the pages of your site. Then it continues indexing the words and content of
the site, and finally it visits the links (web page addresses or URLs) that are found in your site.
When the spider doesn‘t find a page, it will eventually be deleted from the index. Processing is
then applied to the indexed data so it can be made into readable/usable format for calculations
that are to take place using LSI or PLSI. Calculations will then produce results showing us the
relevance of the documents in ranking. Selection of the documents above a specific rank is made
and that data is retrieved
2.6.3. Paice/Husk Stemmer One
The Paice/Husk stemmer is an iterative algorithm with one table containing about 120 rules
indexed by the last letter of a suffix . On each iteration, it tries to find an applicable rule by the
last character of the word. Each rule specifies either a deletion or replacement of an ending. If
there is no such rule, it terminates. It also terminates if a word starts with a vowel and there are
only two letters left or if a word starts with a consonant and there are only three characters left.
Otherwise, the rule is applied and the process repeats
Extraction Based Summarization
Various methods have been proposed to achieve extractive summarization. Most of them are
based on scoring of the sentences. Maximal Marginal Relevance scores the sentences according
to their relevance to the query, Mutual Reinforcement Principle for Summary generation uses
clustering of sentences to score them according to how close they are to the central theme. QR
decomposition method scores the sentences using column pivoting. The sentences can also be
scored by certain predefined features. These features may include linguistic features and
statistical features, such as location, rhetorical structure, presence or absence of certain syntactic
features and presence of proper names, and statistical measures of term prominence. Rough set
based extractive summarization has been proposed that aims at selecting important sentences
from a given text using rough sets, which has been traditionally used to discover patterns hidden
in data. Methods using similarity between sentences and measures of prominence of certain
semantic concepts and relationships to generate an extractive summary have also been proposed.
Some commercially available extractive summarizers like Copernic and Word summarizers use
certain statistical algorithms to create a list of important concepts and hence generate a summary.
We propose to achieve extractive summarization as a three-step process
Crawler
A Web crawler is a program/software that starts with a list of URLs to visit, called the seeds. As
the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list
of URLs to visit, called ―the crawl frontier‖. URLs from the frontier are recursively visited
according to a set of policies. Typical Search Engines run thousands of instances of their web
crawling programs simultaneously, on multiple servers. When a web crawler visits one of the
pages, it loads the site‘s content into a database. Once a page has been fetched, the text of that
page is loaded into the search engine‘s index, which is a massive database of words, and where
they occur on different web pages.
CONCLUSION
This project gives an idea of making a Semantic based Search Engine that uses various
approaches to yield a new and useful search experience for the users. Semantic search has the
power to enhance the traditional web search since it looks for the relationship shared by
documents and words within them. Also search results should be displayed in a manner that a
user can understand what content a webpage contains just by seeing at the search results. Also if
a user wants to get information on any keyword, he should be shown overall summary of the
keywords by extracting necessary details from all the relevant web pages.