04-08-2012, 01:21 PM
Data compression
Data compression.docx (Size: 128.15 KB / Downloads: 35)
Introduction:-
Compression is used just about everywhere. All the images you get on the web are compressed, typically in the JPEG or GIF formats, most modems use compression, HDTV will be compressed using MPEG-2, and several file systems automatically compress files when stored, and the rest of us do it by hand. The neat thing about compression, as with the other topics we will cover in this course, is that the algorithms used in the real world make heavy use of a wide set of algorithmic tools, including sorting, hash tables, tries, and FFTs. Furthermore, algorithms with strong theoretical foundations play a critical role in real-world applications. The task of compression consists of two components, an encoding algorithm that takes a message and generates a “compressed” representation (hopefully with fewer bits), and a decoding algorithm that reconstructs the original message or some approximation of it from the compressed representation. These two components are typically intricately tied together since they both have to understand the shared compressed representation.We distinguish between lossless algorithms, which can reconstruct the original message exactly from the compressed message, and lossy algorithms, which can only reconstruct an approximation of the original message. Lossless algorithms are typically used for text, and lossy for images and sound where a little bit of loss in resolution is often undetectable, or at least acceptable. Lossy is used in an abstract sense, however, and does not mean random lost pixels, but instead means loss of a quantity such as a frequency component, or perhaps loss of noise. For example, one might think that lossy text compression would be unacceptable because they are imagining missing or switched characters. Consider instead a system that reworded sentences into a more standard form, or replaced words with synonyms so that the file can be better compressed. Technically the compression would be lossy since the text has changed, but the “meaning” and clarity of the message might be fully maintained, or even improved. In fact Strunk and White might argue that good writing is the art of lossy text compression. Is there a lossless algorithm that can compress all messages? There has been at least onepatent application that claimed to be able to compress all files. “Methods for Data Compression”. The patent application claimed that if it was applied recursively, a file could be reduced to almost nothing. With a little thought you should convince yourself that this is not possible, at least if the source messages can contain any bit-sequence.
The data compression feature in the Microsoft® SQL Server® 2008 database software can help reduce the size of the database as well as improve the performance of I/O intensive workloads. However, extra CPU resources are required on the database server to compress and decompress the data, while data is exchanged with the application. Therefore, it is important to understand the workload characteristics when deciding which tables to compress. This white paper provides guidance on the following:
• How to decide which tables and indexes to compress
• How to estimate the resources required to compress a table
• How to reclaim space released by data compression
• The performance impacts of data compression on typical workloads
data compression:-
Data compression is particularly useful in communications because it enables devices to transmit or store the same amount of data in fewer bits. There are a variety of data compression techniques, but only a few have been standardized. The CCITT has defined a standard data compression technique for transmitting faxes and a compression standard for data communications through modems (CCITT V.42bis). In addition, there are file compression formats, such as ARC and ZIP. Data compression is also widely used in backup utilities, spreadsheet applications, and database management systems. Certain types of data, such as bit-mapped graphics, can be compressed to a small fraction of their normal size.
DCC is held at the Cliff Lodge convention center in the beautiful Snowbird / Alta Ski areas; located a short ride from the Salt Lake City International Airport. It is an international forum for current work on data compression and related applications. The conference addresses not only compression methods for specific types of data (text, images, video, audio, medical, scientific, space, graphics, web content, etc.), but also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large data sets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Both theoretical and experimental work are of interest.
About Data Compression:-
SQL Server 2008 provides two levels of data compression – row compression and page compression. Row compression helps store data more efficiently in a row by storing fixed-length data types in variable-length storage format. A compressed row uses 4 bits per compressed column to store the length of the data in the column. NULL and 0 values across all data types take no additional space other than these 4 bits.
Page compression is a superset of row compression. In addition to storing data efficiently inside a row, page compression optimizes storage of multiple rows in a page, by minimizing the data redundancy. Page compression uses prefix compression and dictionary compression. Prefix compression looks for common patterns in the beginning of the column values on a given column across all rows on each page. Dictionary compression looks for exact value matches across all columns and rows on each page. Both dictionary and prefix are type-agnostic and see every column value as a bag of bytes. For more information about the data compression feature, see SQL Server Books Online. The SQL Server Storage Engine blog is also a great resource for the internals of data compression. The data compression feature is available in the Enterprise and Developer editions of SQL Server 2008. Databases with compressed tables or indexes cannot be restored, attached, or in any way used on other editions. To determine whether a database is using compression, query the dynamic management view (DMV)sys.
Deciding What to Compress:-
SQL Server 2008 provides great flexibility in how data compression is used. Row and page compression can be configured at the table, index, indexed view, or partition level. Some examples of the flexibility in applying data compression are to:
• Row-compress some tables, page-compress some others, and don’t compress the rest.
• Page-compress a heap or clustered index, but have no compression on its nonclustered indexes.
• Row-compress one index, and have no compression on another index.
• Row-compress some partitions of a table, page-compress some others, and don’t compress the rest.
With this flexibility comes the challenge in deciding what to compress. This section provides some guidelines to assist in deciding what to compress. Some of the factors that influence this decision are:
• Estimated space savings
• Application workload
Estimated Space Savings:-
The stored procedure sp_estimate_data_compression_savings estimates the amount of space saved by compressing a table and its indexes. It functions by taking a sample of the data and then compressing it in tempdb. Estimate the space savings for the largest tables and indexes in a database, and consider compressing only those tables and indexes that yield significant space savings.
The stored procedure sp_estimate_data_compression_savings estimates the space savings one table at a time. This can be wrapped in a script to estimate the space savings for all the tables and indexes in a database – as shown in these two blogs: Whole Database - Data Compression Procs and Procedure used for applying Database Compression to Microsoft SAP ERP system. Be aware that estimating data compression savings on an entire database may take a long time in a database with several thousand tables and indexes, such as an SAP ERP database.
Data and Data Types:-
The amount of space saved by compressing a table depends on the “data” the table contains (after all, it is called “data” compression!). Some data compresses significantly, while some other doesn’t. Tables that contain the following patterns of data compress very well:
• Columns with numeric or fixed-length character data types where most values don’t require all the allocated bytes: For example, integers where most values are less than 1000
• Nullable columns where a significant number of the rows have a NULL value for the column
• Significant amounts of repeating data values or repeating prefix values in the data
Some patterns of data that do not benefit much from compression are:
• Columns with numeric or fixed-length character data types where most values require all the bytes allocated for the specific data type
• Not much repeating data
• Repeating data with non-repeating prefixes
• Data stored out of the row
• FILESTREAM data
Application Workload:-
Compressed pages are persisted as compressed on disk and stay compressed when read into memory. Data is decompressed (not the entire page, but only the data values of interest) when it meets one of the following conditions:
• It is read for filtering, sorting, joining, as part of a query response.
• It is updated by an application.
There is no in-memory, decompressed copy of the compressed page. Decompressing data consumes CPU. However, because compressed data uses fewer data pages, it also saves:
• Physical I/O: Because physical I/O is expensive from a workload perspective, reduced physical I/O often results in a bigger saving than the additional CPU cost to compress and decompress the data. Note that physical I/O is saved both because a smaller volume of data is read from or written to disk, and because more data can remain cached in buffer pool memory.
• Logical I/O (if data is in memory):
Because logical I/O consumes CPU, reduced logical I/O can sometimes compensate for the CPU cost to compress and decompress the data.
The savings in logical and physical I/O is largest when tables or indexes are scanned. When singleton lookups (for read or write) are performed, I/O savings from compression are smaller - they only occur if compression causes more requests to target the same page, and this leads to reduced physical I/O.
The CPU overhead of row compression is usually minimal (generally less than or equal to 10 percent in our experience). If row compression results in space savings and the system can accommodate a 10 percent increase in CPU usage, all data should be row-compressed. For example, SAP ERP Net Weaver 7.00 Business Suite 7 and above use row compression on all tables.