26-08-2014, 12:31 PM
The Zettabyte File System On Seminar Report
The Zettabyte File System.pdf (Size: 178.64 KB / Downloads: 10)
Abstract
In this paper we describe a new file system that
provides strong data integrity guarantees, simple
administration, and immense capacity. We show
that with a few changes to the traditional high-level
file system architecture — including a redesign of
the interface between the file system and volume
manager, pooled storage
Introduction
Upon hearing about our work on ZFS, some people
appear to be genuinely surprised and ask, “Aren’t
local file systems a solved problem?” From this
question, we can deduce that the speaker has
probably never lost important files, run out of
space on a partition, attempted to boot with a
damaged root file system, needed to repartition a
disk, struggled with a volume manager, spent a
weekend adding new storage to a file server, tried
to grow or shrink a file system, mistyped something
in /etc/fstab, experienced silent data corruption,
or waited for fsck to finish. Some people are lucky
enough to never encounter these problems because
they are handled behind the scenes by system
administrators. Others accept such inconveniences
as inevitable in any file system. While the last
few decades of file system research have resulted
in a great deal of progress in performance and
recoverability, much room for improvement remains
in the areas of data integrity, availability, ease of
administration, and scalability.
Design principles
In this section we describe the design principles we
used to design ZFS, based on our goals of strong
data integrity, simple administration, and immense
capacity
Dynamic file system size
If a file system can only use space from its partition,
the system administrator must then predict (i.e.,
guess) the maximum future size of each file system
Always consistent on-disk data
Most file systems today still allow the on-disk data
to be inconsistent in some way for varying periods
of time. If an unexpected crash or power cycle
happens while the on-disk state is inconsistent,
the file system will require some form of repair
Error detection and correction
In the ideal world, disks never get corrupted,
hardware RAID never has bugs, and reads always
return the same data that was written. In the
real world, firmware has bugs too. Bugs in disk
controller firmware can result in a variety of errors,
including misdirected reads, misdirected writes, and
phantom writes.5 In addition to hardware failures,
file system corruption can be caused by software
or administrative errors
Integration of the volume manager
The traditional way to add features like mirroring
is to write a volume manager that exports a logical
block device that looks exactly like a physical block
device. The benefit of this approach is that any
file system can use any volume manager and no file
system code has to be changed. However, emulating
a regular block device has serious drawbacks: the
block interface destroys all semantic information,
so the volume manager ends up managing on-disk
consistency much more conservatively than it needs
to since it doesn’t know what the dependencies
between blocks are. It also doesn’t know which
blocks are allocated and which are free, so it must
assume that all blocks are in use and need to be
kept consistent and up-to-date. In general, the
volume manager can’t make any optimizations
based on knowledge of higher-level semantics
The Storage Pool Allocator
The Storage Pool Allocator (SPA) allocates blocks
from all the devices in a storage pool. One system
can have multiple storage pools, although most
systems will only need one pool. Unlike a volume
manager, the SPA does not present itself as a
logical block device. Instead, it presents itself as
an interface to allocate and free virtually addressed
blocks — basically, malloc() and free() for disk
space. We call the virtual addresses of disk blocks
data virtual addresses (DVAs). Using virtually
addressed blocks makes it easy to implement several
of our design principles. First, it allows dynamic
addition and removal of devices from the storage
pool without interrupting service. None of the
code above the SPA layer knows where a particular
block is physically located, so when a new device
is added, the SPA can immediately start allocating
new blocks from it without involving the rest of
the file system code
ZFS in action
All of this high-level architectural stuff is great,
but what does ZFS actually look like in practice?
In this section, we’ll use a transcript (slightly
edited for two-column format) of ZFS in action to
demonstrate three of the benefits we claim for ZFS:
simplified administration, virtualization of storage,
and detection and correction of data corruption.
First, we’ll create a storage pool and several ZFS
file systems. Next, we’ll add more storage to the
pool dynamically and show that the file systems
start using the new space immediately. Then
we’ll deliberately scribble garbage on one side of a
mirror while it’s in active use, and show that ZFS
automatically detects and corrects the resulting
data corruption
Future work
ZFS opens up a great number of possibilities.
Databases or NFS servers could use the DMU’s
transactional object interface directly. File systems
of all sorts are easily built on top of the DMU —
two of our team members implemented the first
usable prototype of the ZPL in only six weeks. The
zvol driver, a sparse volume emulator that uses a
DMU object as backing store, was implemented in
Conclusion
Current file systems still suffer from a variety of
problems: complicated administration, poor data
integrity guarantees, and limited capacity. The
key architectural elements of ZFS that solve these
problems are pooled storage, the movement of block
allocation out of the file system and into the storage
pool allocator, an object-based storage model,
checksumming of all on-disk blocks, transactional
copy-on-write of all blocks