18-12-2012, 04:35 PM
Decision Trees for Uncertain Data
1Decision Trees.pdf (Size: 547.06 KB / Downloads: 34)
Abstract
Traditional decision tree classifiers work with data
whose values are known and precise. We extend such classifiers
to handle data with uncertain information. Value uncertainty
arises in many applications during the data collection process. Example
sources of uncertainty include measurement/quantisation
errors, data staleness, and multiple repeated measurements.
With uncertainty, the value of a data item is often represented
not by one single value, but by multiple values forming a
probability distribution. Rather than abstracting uncertain data
by statistical derivatives (such as mean and median), we discover
that the accuracy of a decision tree classifier can be much
improved if the “complete information” of a data item (taking
into account the probability density function (pdf)) is utilised.
We extend classical decision tree building algorithms to handle
data tuples with uncertain values. Extensive experiments have
been conducted that show that the resulting classifiers are more
accurate than those using value averages. Since processing pdf’s
is computationally more costly than processing single values
(e.g., averages), decision tree construction on uncertain data is
more CPU demanding than that for certain data. To tackle this
problem, we propose a series of pruning techniques that can
greatly improve construction efficiency.
INTRODUCTION
Classification is a classical problem in machine learning
and data mining[1]. Given a set of training data tuples, each
having a class label and being represented by a feature vector,
the task is to algorithmically build a model that predicts
the class label of an unseen test tuple based on the tuple’s
feature vector. One of the most popular classification models
is the decision tree model. Decision trees are popular because
they are practical and easy to understand. Rules can also be
extracted from decision trees easily. Many algorithms, such
as ID3[2] and C4.5[3] have been devised for decision tree
construction. These algorithms are widely adopted and used
in a wide range of applications such as image recognition,
medical diagnosis[4], credit rating of loan applicants, scientific
tests, fraud detection, and target marketing.
RELATED WORKS
There has been significant research interest in uncertain
data management in recent years. Data uncertainty has been
broadly classified as existential uncertainty and value uncertainty.
Existential uncertainty appears when it is uncertain
whether an object or a data tuple exists. For example, a
data tuple in a relational database could be associated with
a probability that represents the confidence of its presence[8].
“Probabilistic databases” have been applied to semi-structured
data and XML[9], [10]. Value uncertainty, on the other hand,
appears when a tuple is known to exist, but its values are
not known precisely. A data item with value uncertainty is
usually represented by a pdf over a finite and bounded region
of possible values[11], [12]. One well-studied topic on value
uncertainty is “imprecise queries processing”. The answer to
such a query is associated with a probabilistic guarantee on its
correctness. For example, indexing solutions for range queries
on uncertain data[13], solutions for aggregate queries[14] such
as nearest neighbour queries, and solutions for imprecise
location-dependent queries[11] have been proposed.
ALGORITHMS
In this section, we discuss two approaches for handling uncertain
data. The first approach, called “Averaging”, transforms
an uncertain dataset to a point-valued one by replacing each
pdf with its mean value. More specifically, for each tuple ti and
attribute Aj , we take the mean value1 vi;j =
R bi;j
ai;j
xfi;j(x) dx
as its representative value. The feature vector of ti is thus
transformed to (vi;1; : : : ; vi;k). A decision tree can then be
built by applying a traditional tree construction algorithm
Averaging
A straight-forward way to deal with the uncertain information
is to replace each pdf with its expected value, thus
effectively converting the data tuples to point-valued tuples.
This reduces the problem back to that for point-valued data,
and hence traditional decision tree algorithms such as ID3
and C4.5[3] can be reused. We call this approach AVG (for
Averaging). We use an algorithm based on C4.5. Here is a
brief description.
Experiments on Accuracy
To explore the potential of achieving a higher classification
accuracy by considering data uncertainty, we have implemented
AVG and UDT and applied them to 10 real data
sets (see Table II) taken from the UCI Machine Learning
Repository[34]. These datasets are chosen because they contain
mostly numerical attributes obtained from measurements.
For the purpose of our experiments, classifiers are built on the
numerical attributes and their “class label” attributes. Some
data sets are already divided into “training” and “testing”
tuples. For those that are not, we use 10-fold cross validation
to measure the accuracy.
EXPERIMENTS ON EFFICIENCY
The algorithms described above have been implemented8
in Java using JDK 1.6 and a series of experiments were
performed on a PC with an Intel Core 2 Duo 2.66GHz CPU
and 2GB of main memory, running Linux kernel 2.6.22 i686.
Experiments on the accuracy of our novel distribution-based
UDT algorithm has been presented already in Section IV-B.
In this section, we focus on the pruning effectiveness of our
pruning algorithms and their run-time performance.
The data sets used are the same as those used in Section
IV-B. The same method is used to synthesise data uncertainty.
Only Gaussian distribution is used for the experiments
below. We use the parameters s = 100 (no. of sample points
per pdf) and w = 10% (width of the pdf’s domain, as a
percentage of the width of the attribute’s domain) as the
baseline settings.
DISCUSSIONS
The Uncertainty Model
In our discussion, uncertainty models of attributes have
been assumed known by some external means. In practice,
finding a good model is an application-dependent endeavour.
For example, manufacturers of some measuring instruments do
specify in instruction manuals the error of the devices, which
can be used as a source of information for modelling error
distributions. In some other cases, repeated measurements
can be taken and the resulting histogram can be used to
approximate the pdf (as we have done in Section IV-C with
the “JapaneseVowel” dataset). In the case of random noise,
for example, one could fit a Gaussian distribution5 using
the sample mean and variance, thanks to the Central Limit
Theorem[35].
CONCLUSIONS
We have extended the model of decision-tree classification
to accommodate data tuples having numerical attributes with
uncertainty described by arbitrary pdf’s. We have modified
classical decision tree building algorithms (based on the framework
of C4.5[3]) to build decision trees for classifying such
data. We have found empirically that when suitable pdf’s are
used, exploiting data uncertainty leads to decision trees with
remarkably higher accuracies. We therefore advocate that data
be collected and stored with the pdf information intact. Performance
is an issue, though, because of the increased amount of
information to be processed, as well as the more complicated
entropy computations involved. Therefore, we have devised
a series of pruning techniques to improve tree construction
efficiency. Our algorithms have been experimentally verified
to be highly effective. Their execution times are of an order
of magnitude comparable to classical algorithms. Some of
these pruning techniques are generalisations of analogous
techniques for handling point-valued data. Other techniques,
namely pruning by bounding and end-point sampling are
novel. Although our novel techniques are primarily designed to
handle uncertain data, they are also useful for building decision
trees using classical algorithms when there are tremendous
amounts of data tuples.