30-07-2012, 02:36 PM
The Google File System
The Google.pdf (Size: 269.47 KB / Downloads: 27)
ABSTRACT
We have designed and implemented the Google File System,
a scalable distributed file system for large distributed
data-intensive applications. It provides fault tolerance while
running on inexpensive commodity hardware, and it delivers
high aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed
file systems, our design has been driven by observations
of our application workloads and technological environment,
both current and anticipated, that reflect a marked
departure from some earlier file system assumptions. This
has led us to reexamine traditional choices and explore radically
different design points.
INTRODUCTION
We have designed and implemented the Google File System
(GFS) to meet the rapidly growing demands of Google’s
data processing needs. GFS shares many of the same goals
as previous distributed file systems such as performance,
scalability, reliability, and availability. However, its design
has been driven by key observations of our application workloads
and technological environment, both current and anticipated,
that reflect a marked departure from some earlier
file system design assumptions. We have reexamined traditional
choices and explored radically different points in the
design space.
First, component failures are the norm rather than the
exception. The file system consists of hundreds or even
thousands of storage machines built from inexpensive commodity
parts and is accessed by a comparable number of
client machines. The quantity and quality of the components
virtually guarantee that some are not functional at
any given time and some will not recover from their current
failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures
of disks, memory, connectors, networking, and power supplies.
Therefore, constant monitoring, error detection, fault
tolerance, and automatic recovery must be integral to the
system.
Interface
GFS provides a familiar file system interface, though it
does not implement a standard API such as POSIX. Files are
organized hierarchically in directories and identified by pathnames.
We support the usual operations to create, delete,
open, close, read, and write files.
Moreover, GFS has snapshot and record append operations.
Snapshot creates a copy of a file or a directory tree
at low cost. Record append allows multiple clients to append
data to the same file concurrently while guaranteeing
the atomicity of each individual client’s append. It is useful
for implementing multi-way merge results and producerconsumer
queues that many clients can simultaneously append
to without additional locking. We have found these
types of files to be invaluable in building large distributed
applications. Snapshot and record append are discussed further
in Sections 3.4 and 3.3 respectively.
Architecture
A GFS cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients, as shown
in Figure 1. Each of these is typically a commodity Linux
machine running a user-level server process. It is easy to run
both a chunkserver and a client on the same machine, as long
as machine resources permit and the lower reliability caused
by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. Each chunki s
identified by an immutable and globally unique 64 bit chunk
handle assigned by the master at the time of chunkcreat ion.
Chunkservers store chunks on local disks as Linux files and
read or write chunkda ta specified by a chunkha ndle and
byte range. For reliability, each chunkis replicated on multiple
chunkservers. By default, we store three replicas, though
users can designate different replication levels for different
regions of the file namespace.
Chunk Size
Chunks ize is one of the key design parameters. We have
chosen 64 MB, which is much larger than typical file system
blocks izes. Each chunk replica is stored as a plain
Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal
fragmentation, perhaps the greatest objection against such
a large chunksize.
A large chunksize offers several important advantages.
First, it reduces clients’ need to interact with the master
because reads and writes on the same chunkre quire only
one initial request to the master for chunklo cation information.
The reduction is especially significant for our workloads
because applications mostly read and write large files
sequentially. Even for small random reads, the client can
comfortably cache all the chunklo cation information for a
multi-TB working set.
Metadata
The master stores three major types of metadata: the file
and chunkna mespaces, the mapping from files to chunks,
and the locations of each chunk’s replicas. All metadata is
kept in the master’s memory. The first two types (namespaces
and file-to-chunkma pping) are also kept persistent by
logging mutations to an operation log stored on the master’s
local diskan d replicated on remote machines. Using
a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master
crash. The master does not store chunklo cation information
persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins
the cluster.