13-11-2012, 01:54 PM
A Framework for Learning Comprehensible Theories in XML Document
Classification
Abstract
XML has become the universal data format for a wide
variety of information systems. The large number of
XML documents existing on the web and in other
information storage systems makes classification an
important task. As a typical type of semi structured data,
XML documents have both structures and contents.
Traditional text learning techniques are not very suitable
for XML document classification as structures are not
considered. This paper presents a novel complete
framework for XML document classification. We first
present a knowledge representation method for XML
documents which is based on a typed higher order logic
formalism. With this representation method, an XML
document is represented as a higher order logic term
where both its contents and structures are captured. We
then present a decision-tree learning algorithm driven by
precision/recall breakeven point (PRDT) for the XML
classification problem which can produce
comprehensible
theories. Finally, a semi-supervised learning algorithm is
given which is based on the PRDT algorithm and the
cotraining framework. Experimental results demonstrate
that our framework is able to achieve good performance
in both supervised and semi-supervised learning with
the bonus of producing comprehensible learning
theories.