Data compression full report

**seminar ideas** · 04-08-2012, 01:21 PM

Data compression

.docx

Data compression.docx (Size: 128.15 KB / Downloads: 35)

Introduction:-

Compression is used just about everywhere. All the images you get on the web are compressed, typically in the JPEG or GIF formats, most modems use compression, HDTV will be compressed using MPEG-2, and several file systems automatically compress files when stored, and the rest of us do it by hand. The neat thing about compression, as with the other topics we will cover in this course, is that the algorithms used in the real world make heavy use of a wide set of algorithmic tools, including sorting, hash tables, tries, and FFTs. Furthermore, algorithms with strong theoretical foundations play a critical role in real-world applications. The task of compression consists of two components, an encoding algorithm that takes a message and generates a “compressed” representation (hopefully with fewer bits), and a decoding algorithm that reconstructs the original message or some approximation of it from the compressed representation. These two components are typically intricately tied together since they both have to understand the shared compressed representation.We distinguish between lossless algorithms, which can reconstruct the original message exactly from the compressed message, and lossy algorithms, which can only reconstruct an approximation of the original message. Lossless algorithms are typically used for text, and lossy for images and sound where a little bit of loss in resolution is often undetectable, or at least acceptable. Lossy is used in an abstract sense, however, and does not mean random lost pixels, but instead means loss of a quantity such as a frequency component, or perhaps loss of noise. For example, one might think that lossy text compression would be unacceptable because they are imagining missing or switched characters. Consider instead a system that reworded sentences into a more standard form, or replaced words with synonyms so that the file can be better compressed. Technically the compression would be lossy since the text has changed, but the “meaning” and clarity of the message might be fully maintained, or even improved. In fact Strunk and White might argue that good writing is the art of lossy text compression. Is there a lossless algorithm that can compress all messages? There has been at least onepatent application that claimed to be able to compress all files. “Methods for Data Compression”. The patent application claimed that if it was applied recursively, a file could be reduced to almost nothing. With a little thought you should convince yourself that this is not possible, at least if the source messages can contain any bit-sequence.
The data compression feature in the Microsoft® SQL Server® 2008 database software can help reduce the size of the database as well as improve the performance of I/O intensive workloads. However, extra CPU resources are required on the database server to compress and decompress the data, while data is exchanged with the application. Therefore, it is important to understand the workload characteristics when deciding which tables to compress. This white paper provides guidance on the following:
• How to decide which tables and indexes to compress
• How to estimate the resources required to compress a table
• How to reclaim space released by data compression
• The performance impacts of data compression on typical workloads

data compression:-

Data compression is particularly useful in communications because it enables devices to transmit or store the same amount of data in fewer bits. There are a variety of data compression techniques, but only a few have been standardized. The CCITT has defined a standard data compression technique for transmitting faxes and a compression standard for data communications through modems (CCITT V.42bis). In addition, there are file compression formats, such as ARC and ZIP. Data compression is also widely used in backup utilities, spreadsheet applications, and database management systems. Certain types of data, such as bit-mapped graphics, can be compressed to a small fraction of their normal size.
DCC is held at the Cliff Lodge convention center in the beautiful Snowbird / Alta Ski areas; located a short ride from the Salt Lake City International Airport. It is an international forum for current work on data compression and related applications. The conference addresses not only compression methods for specific types of data (text, images, video, audio, medical, scientific, space, graphics, web content, etc.), but also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large data sets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Both theoretical and experimental work are of interest.

About Data Compression:-

SQL Server 2008 provides two levels of data compression – row compression and page compression. Row compression helps store data more efficiently in a row by storing fixed-length data types in variable-length storage format. A compressed row uses 4 bits per compressed column to store the length of the data in the column. NULL and 0 values across all data types take no additional space other than these 4 bits.
Page compression is a superset of row compression. In addition to storing data efficiently inside a row, page compression optimizes storage of multiple rows in a page, by minimizing the data redundancy. Page compression uses prefix compression and dictionary compression. Prefix compression looks for common patterns in the beginning of the column values on a given column across all rows on each page. Dictionary compression looks for exact value matches across all columns and rows on each page. Both dictionary and prefix are type-agnostic and see every column value as a bag of bytes. For more information about the data compression feature, see SQL Server Books Online. The SQL Server Storage Engine blog is also a great resource for the internals of data compression. The data compression feature is available in the Enterprise and Developer editions of SQL Server 2008. Databases with compressed tables or indexes cannot be restored, attached, or in any way used on other editions. To determine whether a database is using compression, query the dynamic management view (DMV)sys.

Deciding What to Compress:-

SQL Server 2008 provides great flexibility in how data compression is used. Row and page compression can be configured at the table, index, indexed view, or partition level. Some examples of the flexibility in applying data compression are to:
• Row-compress some tables, page-compress some others, and don’t compress the rest.
• Page-compress a heap or clustered index, but have no compression on its nonclustered indexes.
• Row-compress one index, and have no compression on another index.
• Row-compress some partitions of a table, page-compress some others, and don’t compress the rest.
With this flexibility comes the challenge in deciding what to compress. This section provides some guidelines to assist in deciding what to compress. Some of the factors that influence this decision are:
• Estimated space savings
• Application workload

Estimated Space Savings:-

The stored procedure sp_estimate_data_compression_savings estimates the amount of space saved by compressing a table and its indexes. It functions by taking a sample of the data and then compressing it in tempdb. Estimate the space savings for the largest tables and indexes in a database, and consider compressing only those tables and indexes that yield significant space savings.
The stored procedure sp_estimate_data_compression_savings estimates the space savings one table at a time. This can be wrapped in a script to estimate the space savings for all the tables and indexes in a database – as shown in these two blogs: Whole Database - Data Compression Procs and Procedure used for applying Database Compression to Microsoft SAP ERP system. Be aware that estimating data compression savings on an entire database may take a long time in a database with several thousand tables and indexes, such as an SAP ERP database.

Data and Data Types:-

The amount of space saved by compressing a table depends on the “data” the table contains (after all, it is called “data” compression!). Some data compresses significantly, while some other doesn’t. Tables that contain the following patterns of data compress very well:
• Columns with numeric or fixed-length character data types where most values don’t require all the allocated bytes: For example, integers where most values are less than 1000
• Nullable columns where a significant number of the rows have a NULL value for the column
• Significant amounts of repeating data values or repeating prefix values in the data
Some patterns of data that do not benefit much from compression are:
• Columns with numeric or fixed-length character data types where most values require all the bytes allocated for the specific data type
• Not much repeating data
• Repeating data with non-repeating prefixes
• Data stored out of the row
• FILESTREAM data

Application Workload:-

Compressed pages are persisted as compressed on disk and stay compressed when read into memory. Data is decompressed (not the entire page, but only the data values of interest) when it meets one of the following conditions:
• It is read for filtering, sorting, joining, as part of a query response.
• It is updated by an application.
There is no in-memory, decompressed copy of the compressed page. Decompressing data consumes CPU. However, because compressed data uses fewer data pages, it also saves:
• Physical I/O: Because physical I/O is expensive from a workload perspective, reduced physical I/O often results in a bigger saving than the additional CPU cost to compress and decompress the data. Note that physical I/O is saved both because a smaller volume of data is read from or written to disk, and because more data can remain cached in buffer pool memory.

• Logical I/O (if data is in memory):

Because logical I/O consumes CPU, reduced logical I/O can sometimes compensate for the CPU cost to compress and decompress the data.
The savings in logical and physical I/O is largest when tables or indexes are scanned. When singleton lookups (for read or write) are performed, I/O savings from compression are smaller - they only occur if compression causes more requests to target the same page, and this leads to reduced physical I/O.
The CPU overhead of row compression is usually minimal (generally less than or equal to 10 percent in our experience). If row compression results in space savings and the system can accommodate a 10 percent increase in CPU usage, all data should be row-compressed. For example, SAP ERP Net Weaver 7.00 Business Suite 7 and above use row compression on all tables.

**seminar flower** · 23-08-2012, 12:25 PM

Data Compression

The Source Coding Theorem

�4.1 How to measure the information content of a random variable?
In the next few chapters, we'll be talking about probability distributions and
random variables. Most of the time we can get by with sloppy notation,
but occasionally, we will need precise notation. Here is the notation that we
established in Chapter 2.

The weighing problem: designing informative experiments

Have you solved the weighing problem (exercise 4.1, p.66) yet? Are you sure?
Notice that in three uses of the balance { which reads either `left heavier',
`right heavier', or `balanced' { the number of conceivable outcomes is 33 = 27,
whereas the number of possible states of the world is 24: the odd ball could
be any of twelve balls, and it could be heavy or light. So in principle, the
problem might be solvable in three weighings { but not in two, since 32 < 24.
If you know how you can determine the odd weight and whether it is
heavy or light in three weighings, then you may read on. If you haven't found
a strategy that always gets there in three weighings, I encourage you to think
about exercise 4.1 some more.
Why is your strategy optimal? What is it about your series of weighings
that allows useful information to be gained as quickly as possible? The answer
is that at each step of an optimal procedure, the three outcomes (`left heavier',
`right heavier', and `balance') are as close as possible to equiprobable. An
optimal solution is shown in gure 4.2.
Suboptimal strategies, such as weighing balls 1{6 against 7{12 on the rst
step, do not achieve all outcomes with equal probability: these two sets of balls
can never balance, so the only possible outcomes are `left heavy' and `right
heavy'. Such a binary outcome only rules out half of the possible hypotheses.

Guessing games

In the game of twenty questions, one player thinks of an object, and the
other player attempts to guess what the object is by asking questions that
have yes/no answers, for example, `is it alive?', or `is it human?' The aim
is to identify the object with as few questions as possible. What is the best
strategy for playing this game? For simplicity, imagine that we are playing
the rather dull version of twenty questions called `sixty-three'.

Data compression

are given the text one word at a time, the Shannon information content of
each ve-character word is log 32,768 = 15 bits, since Wenglish uses all its
words with equal probability. The average information content per character
is therefore 3 bits.
Now let's look at the information content if we read the document one
character at a time. If, say, the rst letter of a word is a, the Shannon
information content is log 1=0:0625 ' 4 bits. If the rst letter is z, the Shannon
information content is log 1=0:001 ' 10 bits. The information content is thus
highly variable at the rst character. The total information content of the 5
characters in a word, however, is exactly 15 bits; so the letters that follow an
initial z have lower average information content per character than the letters
that follow an initial a. A rare initial letter such as z indeed conveys more
information about what the word is than a common initial letter.
Similarly, in English, if rare characters occur at the start of the word (e.g.
xyl...), then often we can identify the whole word immediately; whereas
words that start with common characters (e.g. pro...) require more charac-
ters before we can identify them.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	JPEG Decoder Design Report	seminar flower	1	3,349	25-03-2018, 06:49 PM Last Post: magnumZ
	PIPELINED 2D DCT FOR JPEG IMAGE COMPRESSION REPORT	project girl	1	2,867	25-03-2018, 06:37 PM Last Post: magnumZ
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	SURA Project Monitors Report	seminar projects maker	1	1,177	21-09-2017, 01:30 PM Last Post: jaseela123
	RIA based E- Shopping Portal for Electronic Gadgets Report	study tips	1	1,588	21-09-2017, 01:25 PM Last Post: jaseela123
	THREE DIMENSIONAL PASSWORD FOR MORE SECURE AUTHENTICATION A MAIN PROJECT REPORT	study tips	1	1,300	21-09-2017, 12:56 PM Last Post: jaseela123
	CONSUMER PERCEPTION TOWARDS ONLINE GROCERY STORES” PROJECT REPORT	project maker	1	1,966	21-09-2017, 12:41 PM Last Post: jaseela123
	Training and Placement Cell Management Report	study tips	1	1,838	20-09-2017, 04:05 PM Last Post: jaseela123
	Factory Time Attendance Management System Report	study tips	1	2,638	20-09-2017, 01:19 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.