30-07-2013, 02:48 PM
GOOgle Architecture
GOOgle Architecture.pptx (Size: 135.71 KB / Downloads: 27)
Google is one of the five most popular websites in the world.
Google is a web search engine that lets you find other sites on the web based on keyword searches.
Google also provides specialized searches through blogs, catalogs, videos, news items and more.
Google Key Data Components
Repository
-The repository contains the full HTML of every web page
-In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL
-The repository requires no other data structures to be used in order to access it
Document Index
The document index keeps information about each document
The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics
In order to find the docID of a particular URL, the URL's checksum is computed
Google’s Query Evaluation
Parse the query
Convert words into WordIDs (Using Lexicon)
Select the barrels that contain documents which match the WordIDs
Search through documents in the selected barrels until one is discovered that matches all the search terms
Conclusions
The primary goal is to provide high quality search results over a rapidly growing World Wide Web.
Google employs a number of techniques to improve search quality including page rank, anchor text.
Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them