22-08-2013, 03:34 PM
Search Engines Work
Search Engines .pptx (Size: 1.65 MB / Downloads: 21)
Purpose of Search Engines
Helping people find what they’re looking for
Starts with an “information need”
Convert to a query
Gets results
In the materials available
Web pages
Other formats
Deep Web
Search is Not a Panacea
Search can’t find what’s not there
The content is hugely important
Information Architecture is vital
Usable sites have good navigation and structure
But It's Not
Index ahead of time
Find files or records
Open each one and read it
Store each word in a searchable index
Provide search forms
Match the query terms with words in the index
Sort documents by relevance
Display results
Text Search v.s Database Query
Text search works for structured content
Keyword search vs. SQL queries
Approximate vs. exact match
Multiple sources of content
Response time and database resources
Relevance ranking, very important
Works in the real world (e.g. EBay)
Making a Searchable Index
Store text to search it later
Many ways to gather text
Crawl (spider) via HTTP
Read files on file servers
Access databases (HTTP or API)
Data silos via local APIs
Applications, CMSs, via Web Services
Security and Access Control
What the Index Needs
Basic information for document or record
File name / URL / record ID
Title or equivalent
Size, date, MIME type
Full text of item
More metadata
Product name, picture ID
Category, topic, or subject
Other attributes, for relevance ranking and display
Search Query Processing
What happens after you click the search button and before retrieval starts.
Usually in this order
Handle character set, maybe language
Look for operators and organize the query
Look for field names or metadata
Extract words (just like the indexer)
Deal with letter casing
Relevance Ranking
Theory: sort the matching items, so the most relevant ones appear first
Can't really know what the user wants
Relevance is hard to define and situational
Short queries tend to be deeply ambiguous
What do people mean when they type “bank”?
First 10 results are the most important
The more transparent, the better
Relevance Processing
Sorting documents on various criteria
Start with words matching query terms
Citation and link analysis
Like old library Citation Indexes
Ted Nelson - not only hypertext, but the links
Google PageRank
Incoming links
Authority of linkers
Taxonomies and external metadata
Search Results Interface
What users see after they click the Search button
The most visible part of search
Elements of the results page
Page layout and navigation
Results header
List of results items
Results footer
Search Suggestions (aka Best Bets)
Human judgment beats algorithms
Great for frequent, ambiguous searches
Use search log to identify best candidates
Recommend good starting pages
Product information, FAQs, etc.
Requires human resources
That means money and time
More static than algorithmic search
Search Will Never Be Perfect
Search engines can’t read minds
User queries are short and ambiguous
Some things will help
Design a usable interface
Show match words in context
Keep index current and complete
Adjust heuristic weighting
Maintain suggestions and synonyms
Consider faceted metadata search