15-05-2013, 12:18 PM
BitTorrent Architecture and Protocol
BitTorrent.docx (Size: 26.99 KB / Downloads: 22)
Abstract
Bit Torrent is a new popular application layer network protocol used to distribute files. Bit Torrent is efficient at content delivery by maximizing the upload utilization and by preventing unfairness. This paper discusses the Bit Torrent architecture and protocol in great detail by discussing the tracker and all the messages sent between the peers and the tracker and also between peers and peers. Also this paper will describe how BitTorrent is efficient by looking at algorithms that decide which peers to upload to and which pieces of the file to download first.
Introduction
BitTorrent is an application layer network protocol used to distribute files. It uses a peer-to-peer (P2P) network architecture where many peers act as a client and a server by downloading from peers at the same time they are uploading to others. The serving capacity increases as the number of downloaders increases making the system self-scaling [2]. It also uses a client-server architecture where peers contact the server to find other peers that they may connect to. This paper will give an overview of the BitTorrent architecture and the protocols it uses.
Background
BitTorrent was created by Bram Cohen in 2002. He created it as a way to distribute the free Linux operating system [5]. It is very different from other P2P file sharing protocols because it does not have any search functionality in the protocol. This means that users cannot search for files to download using the BitTorrent protocol. The advantage of this lack of content localization is that BitTorrent focuses on its main goal of delivering content as efficiently as possible [1].BitTorrent does not allow users to share files directly from their computer like other P2P networks. Instead a torrent file needs to be created which represents a peer-to-peer transfer session, for a particular file or files. There is no way of searching for these torrent files by using the BitTorrent protocol. Instead most peers download these torrent files from a website, which usually hosts many torrent files and allows users to upload their own torrent files. Once a peer has the torrent file it may join the torrent session by opening this file in their BitTorrent application. A peer must be in one of two states. It is in the leecher state when it is still downloading the file while uploading pieces it has to other leechers. A peer is in the seed state if it has the complete file and is uploading to leechers. There needs to be at least one seeder in order for the torrent to be alive otherwise the leechers will not be able to finish. Although if the original seed has uploaded every piece once to only one participating peer, it is possible for the seed to leave and the leechers can finish the download by downloading all the other pieces off of other leechers. A file is split into equal size pieces, which are further divided into smaller blocks. Blocks are the transmission unit of the network, but the protocol keeps track of what pieces have been downloaded. Each peer must maintain a list of peers it is connected to, which is called the peer set. Also a peer can only upload to a subset of this peer set called the active peer set. Peers also need to know what pieces of the content each peer in its peer set has [1]. By knowing what peers a peer can upload to and by knowing what pieces all of its connected peers have, BitTorrent can use this information in order to deliver content efficiently. This results in less than a tenth percent of bandwidth overhead and it reliably utilizes all available upload capacity [3]. A P2P network is efficient if it maximizes its upload capacity, a good piece selection algorithm, and it is fair. By being fair, a peer
can not download too much more than it has uploaded. We will see that BitTorrent accomplishes these goals by using algorithms called choke algorithm and rarest first piece selection algorithm. Before the details of these algorithms are discussed, the architecture and the protocol of BitTorrent will be discussed next.
BitTorrent Architecture
BitTorrent is a hybrid network using both the client-server architecture and the peer-to peer architecture. The centralized server is called the tracker. The tracker’s responsibility is to help peers find other peers. A tracker consists of many torrent sessions with each session it keeps track of all of the peers participating in the particular torrent. The peer contacts the tracker and the tracker responds with a list of peers it may connect to. The tracker is not responsible for the actual distribution of the content at all. The bandwidth of the tracker is very low because it is a simple protocol, which peers only connect to when they start up and at defined time intervals of usually 30 minutes [3]. The peer knows the URL of the tracker because it is defined in the torrent file. The torrent file is a static ‘metainfo’ file that represents a session of content being distributed. The torrent file is created with the URL of the tracker and the actual file or files to be part of this torrent. The format of the torrent file is encoding. Encoding consists of nested dictionaries and lists. These dictionaries and lists can contain strings and integers [4]. The torrent file is a bencoded dictionary containing two keys, announce and info. The announce key is the URL of the tracker [4]. The info key maps to a dictionary described below. The info key is a dictionary with the following keys: name, piece length, pieces, and either length or files key. The name key is a string that is the suggested name to download the file as. It is suggested meaning that the downloader peer can choose the name that they desire. Piece length maps to the number of bytes each piece the file is split into. The file is split into equal size pieces except possibly the last piece. The pieces key maps to a string of the SHA1 hash of each piece. Each SHA1 hash is a string of 20 characters long so for example the hash value of piece 4 would be the substring of pieces at character 60 to character 79 assuming that the string begins at index 0. This is used so the downloader can verify the data by checking what they download with these hash values [3]. The next key can either be length or files depending if this torrent is a single file or a directory of files [4]. For a single file the key is length and it is simply the length of the file in bytes. If the torrent represents a directory the key files is used. This key maps to a files list and includes a list of dictionaries containing the following keys, length and path. The length is number of bytes of the file and path is a list of strings that correspond to subdirectory names. This is it for the contents of the torrent file [4]. The parts that are the most important thing to remember about this file is that it contains the tracker URL and splits the file or files into equal size pieces and includes the SHA1 hash values of all the pieces of the file. Now that a peer has a torrent file and connects to the tracker for a list of peers it may join the peer network. The peer must first allocate space for the file or files, which it gets from the torrent file. This is necessary because the peer will not download pieces of a file in order so the BitTorrent application will need to assemble these pieces in order while it receives them. The peer network is a peer-to-peer network where the leechers act as clients and servers and the seeders act just as servers. The peers distribute the file to each other by using the swarming technique [1]. Peers download pieces from multiple peers at the same time. It also uploads to either the same or different peers, pieces of the file it has already downloaded.
The Tracker Protocol
The tracker plays an essential part of BitTorrent because without it, there would be no way for peers to find other peers to download from. Trackers use a simple protocol layered on top of HTTP [3]. The tracker receives HTTP GET requests and it sends bencoded messages to the peer’s request. The tracker GET requests contain the following keys; info_hash, peer_id, ip, port, uploaded, downloaded, left, and event. The info_hash key is how the tracker determines which torrent session the client is a part of or is joining. This key is the 20 byte SHA1 hash of the info key from the torrent file. The peer_id is the id the client randomly generated at the start of the download. This is a string of 20 characters long. The next key, ip is optional, which is generally used for the origin if it is on the same computer as the tracker. The port key is the port number the peer is listening. The key uploaded, downloaded, and left correspond to the total amount uploaded so far, total amount downloaded so far, and number of bytes left to download respectively. These are used so the tracker can keep track of statistics. For instance it may now how many seeders and leechers is in torrent session or how many times the particular content has been downloaded. The next key event is optional which can have the possible values of started, completed, stopped, or empty. Started is used when the download just begins.
Completed is used when the downloader finished downloading. Stopped is used when the peer stops downloading. The string empty is used during the announcements the peer makes at regular intervals [4]. The responses the tracker sends to the peers are bencoded dictionaries. This dictionary must contain two keys, interval and peers. The interval key is the number of seconds the peer should wait between regular requests. The peers key maps to a list of dictionaries that represents a list of peers which contains the keys, peer id, ip, and port. The peer id is the peer id the peer sent to the tracker in tracker request. The ip and port is the IP address and port number of the peer [4]. Typically the tracker returns 50 random peers in this response [1]. Notice how it can keep track of statistics by recording all of the information it receives from the peer requests. The bandwidth of the tracker is very low since peers only connect to the tracker for a very short time in long time intervals (usually 30 minutes). The total amount of bandwidth used by the tracker is currently around a thousandth the total amount of bandwidth used [3].
The BitTorrent Peer Protocol
Peers communicate with each other by sending messages directly to each other using the BitTorrent Peer Protocol. This protocol operates over TCP. In order for two peers to send messages to each other, they must first connect to each other by sending a handshake message. This handshake message starts with the string, “19 BitTorrent Protocol”. The 19 is the length prefix. After this string, there are eight reserved bytes, which currently are not used, but these are added to allow the protocol to be extendable. The next 20 bytes is the SHA1 hash value of the info value of the torrent file. This is same value that is used in tracker requests. This value is needed because a peer may be participating in many torrent swarms and when a peer sends a handshake message, it needs to know what specific swarm it is joining. The next 20 bytes is the 20-byte peer id, which is the same value that is sent to the tracker in the requests [4]. This completes the connection and now the peers may start sending other types of messages to this and all of their connected peers in their peer set. For all of the connections a peer has it must also maintain specific information about them. All connections are either in the choked or the unchoked state. If a peer is choked then it is not allowed to download data in this connection. All connections must also be in the interested or not state. A
peer is interested in another peer, if that peer has pieces of the content it does not have. Only when a peer is interested in another peer and is unchoked it may download from the connection.
Conclusion
BitTorrent is a new successful protocol used to distribute files. It does not have any search capability like other P2P networks, which allows BitTorrent to focus on only the content delivery. It uses a hybrid architecture with the file distribution only occurring in the peer-to-peer architecture. There needs to be a centralized server called the tracker, which is used for peers to find other peers. The protocol overhead is very low and BitTorrent does a good job at maximizing the upload utilization and does a decent job at trying to prevent unfairness. We have seen that BitTorrent is successful from the use of the choke algorithm and the rarest first piece selection. A lot of studies have been done on improving BitTorrent and since its still a fairly new protocol the algorithms are not perfect and are still being improved [2].