01-06-2012, 02:50 PM
User-Centric Web Crawling
User-Centric Web Crawling.ppt (Size: 445.5 KB / Downloads: 38)
Web Crawling Optimization Problem
Not enough resources to (re)download every web document every day/hour
Must pick and choose optimization problem
Others: objective function = avg. freshness, age
Our goal: focus directly on impact on users
Relevance Scoring Function
Search engines’ internal notion of how well a document matches a query
Each D/Q pair numerical score [0,1]
Combination of many factors, including:
Vector-space similarity (e.g., TF.IDF cosine metric)
Link-based factors (e.g., PageRank)
Anchortext of referring pages
Overall Effectiveness
Staleness = fraction of out-of-date documents* [Cho et al. 2000]
Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]
* Used “shingling” to filter out “trivial” changes
Scoring function: PageRank (similar results for TF.IDF)
Related Work
Focused/topic-specific crawling
[Chakrabarti, many others]
Select subset of pages that match user interests
Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests