22-10-2012, 03:56 PM
Compression of XML Data
Compression of XML.pdf (Size: 618.81 KB / Downloads: 24)
ABSTRACT
Most implementation of the Document Object Model (DOM), which are
used to represent XML data in the computer's memory, suer from extensive
memory requirements. Thus large volumes of data are hard to handle.
This work tries to analyse the reasons for this and designs a system that
is specically tailored for the need of large document. A DOM compliant
data structure called DDOM, that uses dictionary compression to reduce the
amount of memory being used, was designed, implemented and tested. Comparison
with other implementations show a saving in the order of 30% to over
80% in terms of memory usage. The in
uence of dierent types of data on
the compression ratio is shown. This report also identies the the conceptual
dierences between relational and irregular data sources with respect to the
possibilities of a compact representation.
INTRODUCTION: XML AND
COMPRESSION
XML, the eXtensible Markup Language, is a current key topic in nearly all
aspects of information technology. It is a way of describing semi-structured
data. Beside the data itself, its semantics are encoded as well. The idea
behind the development of XML through the world wide web consortium
(W3C) was to create a universal way of exchanging data across system and
vendor boundaries.
So far this design goal seems to be achievable as more and more applications
are developed using XML as format for data exchange. Although
the development is still in its early stages, a massive demand becomes obvious.
But every thing comes at a price. The price to pay for versatility and
exchangeability is the verbosity of XML. Data represented in XML format
usually requires more space than represented in another, purpose-designed,
typically binary-coded format. Since XML is primarily designed to exchange
data over networks, i.e. the Internet, this represents a waste of resources,
namely transmission bandwidth. It also creates some problems when working
with the data locally, in case the document becomes too big to t into
the computer's memory.
Document object model (DOM) representation
The document object model (DOM) represents a logical view of an XML
document. Most XML elements introduced above are represented by an
object. These objects have certain properties and methods. The denition of
the DOM [20] declares these objects, their properties and behaviors. It does
not imply a specic implementation though. However, there are some namebindings
included for Java, to avoid multiple implementation with dierent
names for identical functionality. These are given in the form of interfaces,
i.e. method signatures an implementing object has to use.
Of course the content of an XML document remains unchanged, even if it
is represented in a dierent way. The DOM structure makes the hierarchical
structure of XML obvious. Any document has exactly one element at its
highest level, the document element. This element can have an arbitrary
number of children, which in turn can have children again. That means that
any XML document can be represented as a tree.
Other representations
There are some other ways of representing XML documents, which will be of
little to no importance to this project. One emerging standard is the JDOM
API [9], developed by Jason Hunter and Brett McLaughlin. It denes an
API that is more consistent and logical from a Java programmers point of
view. It also tries to integrate the dierent APIs that already exist. The Java community process has dened JAXP [12], the Java APIs for XML
Parsing. It integrates SAX and DOM implementations into a framework that
allows parser independent development and denes standardised methods
for parsing. However, it does not dene its own representation of XML
documents.
COMPRESSED FORMS OF XML
Since the suitability of a compression method will depend on the actual representation
and the intended use of the data, it will be necessary to distinguish
dierent scenarios here. The rst section aims at storage, the second at
transmission of and the third at work with compressed forms of XML. Each
of these is associated with its specic requirements. Finally there is a brief
review of projects already existing in these areas.
Minimising storage space requirements
The requirements for storage are simple: Compress XML text les without
loss of information to the biggest extend possible.
Information expressed in form of textual XML will take considerably more
space than the same information stored in a purpose designed format. This
is due to two factors: The limitation to a textual representation increases
the storage requirements of non textual data. The string of characters representing
the number \127" takes 3 bytes in most encoding schemes, while
the value could easily be represented using only one byte using conventional
integer encoding. The second factor is the mark-up itself. This kind of metainformation
is usually suppressed in other le formats as it is known (usually
by denition) what kind of content appears at what position in the document. This is what makes XML more
exible, but also makes its les bigger.
Considering cases where the structure is relatively simple, for example data
from a relational database, the repetition of the same structure tags for each
row of data is a considerable waste, especially if the structure is already
known in advance, for example by denition in the DTD.
Minimising transmission bandwidth requirements
The requirements for transmission of XML are similar to those of storage
with some extensions. First of all the time needed for compression and
decompression is more crucial than in the rst case, as real-time processing is
likely to be needed. This also implies the need for on-the-
y compression. An
algorithm that has to read the entire document before it can be compressed
will generally be of limited use for a communication application. One also
needs to worry about standardisation and availability of resources across
system boundaries. If a DTD is used to improve the compression ratio as
done by XMill [11] described later, this DTD must be accessible at both ends
of the communication link. If it is not available locally at the receiving end,
it needs to be transferred before the transmission of the compressed data can
begin.
Review: Related work
There are many projects aimed at the compression of textual XML data. A
rst choice is to use an all purpose compression/decompression program such
as GZIP [5]. Most compressors will achieve quite signicant compression ratios
on XML data. The algorithms used internally are well known and based
directly upon Shannon's Information Theory [14]. However, since they only
compress a stream of tokens, the compression ratio is not optimal. Exploiting
properties specic to XML, the compression ratio can be improved even
further.
XMill [11] splits structure and content of an XML document, thus allowing
better compression rates using conventional compression algorithms on
both separate parts. Millau [6] does the same but also encodes tag and attribute
names using binary token, as suggested for the Wireless Application
Protocol (WAP) [18]. Depending on the structure of the document, slightly dierent algorithms are used. If a DTD is available, the compression of the
structure can be improved by just noting the distance from a standard form.
All these methods, although far developed, have a signicant disadvantage:
They only compress whole documents. In order to work with the
document, it must be decompressed rst. Compressed XML of this kind can
not be queried without prior decompression.