09-11-2012, 02:18 PM
An Analysis of XML Compression Efficiency
An Analysis ofXML.pdf (Size: 273.75 KB / Downloads: 21)
ABSTRACT
XML simplifies data exchange among heterogeneous computers,
but it is notoriously verbose and has spawned the development of
many XML-specific compressors and binary formats. We present
an XML test corpus and a combined efficiency metric integrating
compression ratio and execution speed. We use this corpus and
linear regression to assess 14 general-purpose and XML-specific
compressors relative to the proposed metric. We also identify key
factors when selecting a compressor. Our results show XMill or
WBXML may be useful in some instances, but a general-purpose
compressor is often the best choice.
INTRODUCTION
Statistical methods are often used for analyzing experimental
data; however, computer science experiments often only provide a
comparison of means. We describe how we used more robust
statistical methods, i.e., linear regression, to analyze the
performance of 14 compressors against a corpus of XML files we
assembled with respect to an efficiency metric proposed herein.
Our end application is minimizing transmission time of an XML
file between wireless devices, e.g., nodes in a distributed sensor
network (DSN), for example, an unmanned aerial vehicle (UAV)
swarm. Thus, we focus on compressed file sizes and execution
times, foregoing the assessment of decompression time or whether
a particular compressor supports XML queries.
This paper is authored by employees of the U.S. Government and is in the
public domain. This research is supported in part by the Air Force
Communications Agency. The views expressed in this paper are those of
the authors and do not reflect the official policy or position of the U.S. Air
Force, Department of Defense, or the U.S. Government.
XML Schemas
An XML schema specifies the structure and element types that
may appear in an XML file and can be specified in three ways:
implicitly by a raw XML file or explicitly via either a document
type definition (DTD) or an XML Schema Definition (XSD) file.
A schema for Figure 1 can be implicitly obtained via the path tree
defined by its tags. However, implicit schemas do not enable the
data validation possible by explicitly declaring a DTD or XSD.
The use of external schemas may result in smaller XML files and
also enables data validation. The least robust schema approach is
a DTD, whereas an XSD schema is itself an XML file. Certain
compressors require a schema file—we generated DTDs to enable
us to test these compressors (cf. Section 5.1). Further discussion
of XML is beyond the scope of this paper, however, we note two
parser models, SAX and DOM, are often used to read and write
XML files. Other standards, e.g., XPath, XQuery, and XSL,
provide robust mechanisms to access and query XML data.
COMPRESSORS
General-purpose compressors can be classified within two classes,
arithmetic or dictionary. Arithmetic compressors typically have
large memory and execution time requirements, but are useful to
estimate entropy and as control algorithms. The dictionary, or zip,
compressors enjoy widespread use and most have open formats.
Some proprietary formats are 7-zip, CAB, RK, and StuffIt [13]
and may yield smaller files than the zip compressors. We note the
primary criterion for inclusion of an XML compressor within this
study is whether a publicly accessible implementation is available.
Our primary objective was to test any available XML compressor,
as evidenced by our use of Internet archives and a Linux emulator
to enable us to test certain compressors (cf. Appendices B and C).