08-05-2012, 01:39 PM
Analysis and Comparison of Two Important Compression Methods
2.doc (Size: 83 KB / Downloads: 539)
Introduction
Abstract
As the world evolves to rely more and more on computers and data storage, data compression becomes an increasingly useful and important tool. Most data used by humans contains a high amount of redundancy. Compression algorithms aim to eliminate this redundancy, while still preserving all the information contained in the original data. The end result is a reduction in storage requirements, in exchange for increased computation.
One of the most well-known and fundamental compression algorithms is the technique known as Huffman Coding, invented in 1952 by David Huffman. It forms the basis for innumerous theoretical and real-world compression schemes, and serves to prove several important properties of compression in general. Unfortunately, Huffman Coding has shortcomings. It always requires an integral number of bits to represent a given symbol; as a result, inefficiencies arise when the optimal number of bits would be fractional. A technique known as Arithmetic Coding has gained prominence as a replacement for Huffman Coding. It avoids the pitfalls of integral bit counts by representing the compressed data conceptually as a single floating-point number.
This paper aims to present and analyze both algorithms.
Overview of compression’s purpose and history
Compression is used in a wide-ranging variety of applications, from the transmission of science data collected on board NASA space probes, to the storage of digital music on personal computers. Almost as long as there has been digital data, there has been compression of that data.
The basic aim of compression is to sacrifice time and/or processing power in exchange for a reduction in storage requirements. This paper concerns itself in particular with lossless compression, which requires that the compressed data (when uncompressed) restores to an exact copy of the original.
Summary of significant principles in information theory
It is worth noting that there are very strict limits to lossless compression. In particular, no compressor is capable of compressing all data. If it compresses some input streams into a smaller output stream version, it is mathematically provable that there must exist other input streams which actually get larger when fed through the compressor!
Thus lossless compression schemes attempt to compress “typical” data streams, which are a particular subset of all possible input streams. Just what constitutes “typical” depends on your application… thus the abundance of different compression methods.
Huffman Coding
Overview
Huffman Coding is a variable-length character encoding which uses the frequency of symbols in an input string to determine the optimal number of bits for representing each character in the output. It requires an initial scan of the entire input stream to determine the input symbol frequency distribution. This stage builds a tree to represent the optimal unique prefix code of each character. The actual compression is then performed by simply applying the translation given by the prefix code tree.
Precise algorithm description
There are two stages to the static Huffman Coding algorithm. The first is an analysis state to determine the coding scheme, the second is the actual encoding stage.
The first step to determining the coding scheme is to make a single pass over the entire input stream, recording the frequency of each input symbol encountered. Each symbol is then formed into a trivial tree with one node. The number of times a particular symbol occurs in the input stream will be referred to as its tree’s weight. One must then perform the following iterative algorithm: