25-08-2017, 09:32 PM
Internet searching
Internet searching.docx (Size: 210.36 KB / Downloads: 21)
ABSTRACT:
Introduction- steps in searching- crawling google bot discussed in detail, gopher, Archie, metatags- indexing, weight for words, data structure in indexing hashing- search query processing, multiple words withboolean operators in search, page rank, spelling correcting system - future searches, natural language
INTRODUCTION:
In the World Wide Web, there are hundreds of millions of pages available, waiting to present information on an amazing variety of topics. When you need to know about a particular subject, you visit anInternet search engine.
Internet search engines -->special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:
• They search the Internet -- or select pieces of the Internet -- based on important words.
• They keep an index of the words they find, and where they find them.
• They allow users to look for words or combinations of words found in that index.
WEB CRAWLER:
Programs with names like "gopher" and "Archie" kept indexes of files stored on servers connected to the Internet and reduced the amount of time required to find programs and documents.
To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
Search engine built their initial system to use multiple spiders, usually three at one time.
Each spider 1 spider 300 connections to Web pages open at a time.
Peak performance, four spiders 100 pages/sec, and 600 KB of data each second.
Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on a ISP-Internet service provider for the (DNS) Domain name server that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.
Metatags:
Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings -- the Meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct.
Defect: A careless or unscrupulous page owner might add metatags that fit very popular topics but have nothing to do with the actual contents of the page. Prevention:To protect against this, spiders will correlate metatags with page content, rejecting the metatags that don't match the words on the page.
If a webmaster wishes to restrict the information on their site available to a Googlebot, or another well-behaved spider, they can do so with the appropriate directives in a robots.txt file,or by adding the meta tag <meta name="Googlebot" content="nofollow" /> to the web page.
Indexing:
Once the spiders have completed the task of finding information on Web pages, the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:
• The information stored with the data
• The method by which the information is indexed
An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index.
FUTURE SEARCHES:
Many groups are working to improve both results and performance of this type of search engine. Others have moved on to another area of research, called natural-language queries.
The idea behind natural-language queries is that you can type a question in the same way you would ask it to a human sitting beside you -- no need to keep track of Boolean operators or complex query structures. The most popular natural language query site today is AskJeeves.com, which parses the query for keywords that it then applies to the index of sites it has built. It only works with simple queries; but competition is heavy to develop a natural-language query engine that can accept a query of great complexity.