18-08-2012, 01:50 PM
IRLBOT: DESIGN AND PERFORMANCE ANALYSIS OF A LARGE-SCALE WEB CRAWLER
1IRLBOT.pdf (Size: 452.47 KB / Downloads: 23)
ABSTRACT
This thesis shares our experience in designing web crawlers that scale to billions
of pages and models their performance. We show that with the quadratically increas-
ing complexity of verifying URL uniqueness, breadth-¯rst search (BFS) crawl order,
and ¯xed per-host rate-limiting, current crawling algorithms cannot e®ectively cope
with the sheer volume of URLs generated in large crawls, highly-branching spam, le-
gitimate multi-million-page blog sites, and in¯nite loops created by server-side scripts.
We o®er a set of techniques for dealing with these issues and test their performance
in an implementation we call IRLbot. In our recent experiment that lasted 41 days,
IRLbot running on a single server successfully crawled 6:3 billion valid HTML pages
(7:6 billion connection requests) and sustained an average download rate of 319 mb/s
(1; 789 pages/s). Unlike our prior experiments with algorithms proposed in related
work, this version of IRLbot did not experience any bottlenecks and successfully han-
dled content from over 117 million hosts, parsed out 394 billion links, and discovered
a subset of the web graph with 41 billion unique nodes.
INTRODUCTION
Over the last decade, the World Wide Web (WWW) has evolved from a handful of
pages to billions of diverse objects. In order to harvest this enormous data repository,
search engines download parts of the existing web and o®er Internet users access to
this database through keyword search. Search engines consist of two fundamental
components { web crawlers, which ¯nd, download, and parse content in the WWW,
and data miners, which extract keywords from pages, rank document importance,
and answer user queries. This thesis does not deal with data miners, but instead
focuses on the design of web crawlers that can scale to the size of the current and
future web, while implementing consistent per-website and per-server rate-limiting
policies and avoiding being trapped in spam farms and in¯nite webs. We next discuss
our assumptions and explain why this is a challenging issue.
Scalability
With the constant growth of the web, discovery of user-created content by web
crawlers faces an inherent tradeo® between scalability, performance, and resource
usage. The ¯rst term refers to the number of pages N a crawler can handle without
becoming \bogged down" by the various algorithms and data structures needed to
support the crawl. The second term refers to the speed S at which the crawler dis-
covers the web as a function of the number of pages already crawled.
Reputation and Spam
The web has changed signi¯cantly since the days of early crawlers [4], [23], [25], mostly
in the area of dynamically generated pages and web spam. With server-side scripts
that can create in¯nite loops, an unlimited number of hostnames, and spam farms
that measure billions of pages, the task of web crawling has changed from simply
doing a BFS scan of the WWW [24] to deciding in real time which sites contain
useful information and giving them higher priority as the crawl progresses.
Our experience shows that BFS becomes trapped in spam after several billion
downloaded pages, which manifests itself in multiple ways: a) the queue of pending
URLs contains a non-negligible fraction of links from spam sites that threaten to eventually overtake legitimate URLs due to their high branching factor; b) the DNS
resolver succumbs to the rate at which new hostnames are dynamically created within
a single spam domain; and c) the crawler becomes vulnerable to the delay attack from
sites (often spam) that purposely introduce HTTP and DNS delays in all requests
originating from the crawler's IP address.
Politeness
Even today, webmasters become easily annoyed when web crawlers slow down their
servers, consume too much Internet bandwidth, or simply visit pages with \too much"
frequency. This leads to undesirable consequences including blocking of the crawler
from accessing the site in question, various complaints to the Internet Service Provider
(ISP) hosting the crawler, and even threats of legal action. Incorporating per-website
and per-IP hit limits into a crawler is easy; however, preventing the crawler from
\choking" when its entire RAM gets ¯lled up with URLs pending for a small set of
hosts is much more challenging. When N grows into the billions, the crawler eventu-
ally becomes bottlenecked by its own politeness and is then faced with a decision to
su®er signi¯cant slowdown, ignore politeness considerations for certain URLs (at the
risk of crashing target servers or wasting valuable bandwidth on huge spam farms), or
discard a large fraction of backlogged URLs, none of which is particularly appealing.
Our Contributions
The ¯rst part of this thesis presents a set of web-crawler algorithms that address
the issues raised above and the second part brie°y examines their performance in an
actual web crawl.1 Our design stems from three years of web crawling experience at
Texas A&M University using an implementation we call IRLbot [17] and the various
challenges posed in simultaneously: 1) sustaining a ¯xed crawling rate of several thou-
sand pages/s; 2) downloading billions of pages; and 3) operating with the resources
of a single server.
The ¯rst performance bottleneck we faced was caused by the complexity of ver-
ifying uniqueness of URLs and their compliance with robots.txt. As N scales into
many billions, even the disk algorithms of [23], [27] no longer keep up with the rate at
which new URLs are produced by our crawler (i.e., up to 184K per second). To un-
derstand this problem, we analyze the URL-check methods proposed in the literature
and show that all of them exhibit severe performance limitations when N becomes
su±ciently large. We then introduce a new technique called Disk Repository with
Update Management (DRUM) that can store large volumes of arbitrary hashed data
on disk and implement very fast check, update, and check+update operations using
bucket sort.
RELATED WORK
There is only a limited number of papers describing detailed web-crawler algorithms
and o®ering their experimental performance. First-generation designs [9], [22], [25],
[26] were developed to crawl the infant web and commonly reported collecting less
than 100; 000 pages. Second-generation crawlers [2], [7], [15], [14], [23], [27] often
reached several hundred million pages in their crawls and typically involved multiple
agents in the crawling process. We discuss their design and scalability issues in the
next section.
Another direction was undertaken by the Internet Archive [6], [16], which main-
tains a history of the Internet by downloading the same set of pages over and over.
In the last 10 years, this database has collected over 85 billion pages, but only a
small fraction of them are unique. Additional crawlers are [4], [8], [13], [20], [28], [29];
however, their focus usually does not include the large scale assumed in this thesis
and their fundamental crawling algorithms are not presented in su±cient detail to be
analyzed here.
The largest prior crawl using a fully-disclosed implementation appeared in [23],
where Mercator obtained N = 473 million HTML pages in 17 days (we exclude non-
HTML content since it has no e®ect on scalability). The fastest reported crawler was
[13] with 816 pages/s, but the scope of that crawl was only N = 25 million. Finally,
to our knowledge, the largest web dataset used in any paper was AltaVista's 2003
crawl with 1:4 billion pages [11].