01-08-2012, 04:33 PM
Efficient Deduplication Techniques for Modern Backup Operation
Efficient_Deduplication_Techniques_for_Modern_Backup_Operation.pdf (Size: 1.96 MB / Downloads: 37)
INTRODUCTION
Motivation
THE recent introduction of digital TV, digital camcorders,
and other communication technologies has rapidly
accelerated the amount of data being maintained in digital
form. In 2007, for the first time ever, the total volume of
digital contents exceeded the global storage capacity, and it
is estimated that by 2011 only half of the digital information
will be stored [1]. Further, the volume of automatically
generated information exceeds the volume of human
generated digital information [1]. Compounding the problem
of storage space, digitized information has a more
fundamental problem: it is more vulnerable to error
compared to the information in legacy media, e.g., paper,
book, and film. When data is stored in a computer storage
system, a single storage error or power failure can put a
large amount of information in danger. To protect against
such problems, a number of technologies to strengthen the
availability and reliability of digital data have been used,
including mirroring, replication, and adding parity information.
In the application layer, the administrator replicates
the data onto additional copies called “backups” so that that
the original information can be restored in case of data loss.
Related Works
There largely exist three approaches for reducing the size of
information: delta encoding, duplication elimination, and
compression. Each of these techniques is used independently
or in a combined manner to improve the space efficiency and
network bandwidth utilization. Delta encoding stores only
the differences between sequential data. It is a common and
efficient method to reduce data redundancy when changes
are small. It is used in many applications including source
control [2] and backup [3]. Kalkarni et al. [4] proposed
redundancy elimination at block level (REBL), which is a
combination of block suppression, delta encoding, and
compression.
SYSTEM OVERVIEW
System Organization
PRUNE(Prompt Redundancy Elimination) is designed for
distributed environment where backups are located at a
remote site.1 We use the terms “client” and “server” for the
location of the original data to be backed up and the
location of the backup files, respectively. Deduplication
consists of three components: chunking, fingerprint generation,
and detection of redundancy.
Chunking Module
Chunking is the operation of scanning a file and partitioning
it into pieces. Each file piece is called a chunk and is a
unit of redundancy detection. There are two types of
chunking: fixed-size chunking and variable-size chunking.
For fixed-size chunking, a file is partitioned into fixed size
units, e.g., 8 KB blocks. Fixed-size chunking is conceptually
simple and fast. However, this method has an important
drawback: when a small amount of data is inserted into a
file or deleted from a file, an entirely different set of chunks
is generated from the updated file. To effectively address
this problem, variable-size chunking, which is also known
as content-based chunking, has been proposed [8].