23-03-2012, 02:20 PM
Improving Search in Peer-to-Peer Networks
p2psearch.pdf (Size: 329.27 KB / Downloads: 24)
Introduction
Peer-to-peer (P2P) systems are distributed systems in
which nodes of equal roles and capabilities exchange information
and services directly with each other. In recent
years, P2P has emerged as a popular way to share huge volumes
of data. For example, the Morpheus [8] multimedia
file-sharing system reported over 470,000 users sharing a
total of .36 petabytes of data as of October 26, 2001. Sharing
such large volumes of data is made possible by distributing
the main costs – disk space for storing the files
and bandwidth for transferring them – across the peers in
the network. In addition to the ability to pool together
and harness large amounts of resources,
RelatedWork
The originalmotivation for much P2P research was early
“loose” file-sharing P2P systems such as Gnutella [6], Napster
[9], Freenet [5], and Morpheus [8]. The performance
of search techniques used in Gnutella and Freenet are discussed
in Section 3.2. Napster is not a pure P2P system,
but rather a hybrid one containing some centralized
components; performance of hybrid P2P systems is explored
in [18]. Morpheus is a newer, very popular system
which has an architecture partway between Napster’s and
Gnutella’s. “Super-peer” nodes act as a centralized resource
for a small number of clients, but these super-peers connect
to each other to form a pure P2P network.
Problem Overview
The purpose of a data-sharing P2P system is to accept
queries from users, and locate and return data (or pointers
to the data). Each node owns a collection of files or records
to be shared with other nodes. The shared data usually consists
of files, but is not restricted to files (e.g., records stored
in a relational database).
Metrics
Cost. When a query message is propagated through the
network, nodes spend processing resources (i.e., cycles) to
forward the query, process it, etc, and bandwidth to send
and receive messages. The main cost of queries are therefore
bandwidth and processing. Since the cost of a query
is not incurred at any single node in the network, it makes
sense to discuss costs in aggregate (i.e., over all the nodes in
the network). Furthermore, we should not evaluate a policy
based on the performance of any single query, so instead
we measure the average aggregate cost incurred by a set
of queries Qrep, where Qrep is a representative set of real
queries submitted. Our two cost metrics are therefore:
Directed BFS
If minimizing response time is important to an application,
then iterative deepening may not be appropriate because
of the time taken by multiple iterations. A better
strategy would be to send queries immediately to a subset
of nodes that will return many results, and will do so
quickly. The Directed BFS (DBFS) technique implements
this strategy by having a source send query messages to just
a subset of its neighbors, thereby reducing cost, but selecting
neighbors through which nodes with many quality results
may be reached, thereby maintaining quality of results.