04-06-2013, 01:08 PM
Mining HighSpeed Data Streams
Mining HighSpeed.pdf (Size: 134.03 KB / Downloads: 14)
ABSTRACT
Many organizations today have more than very large data-
bases; they have databases that grow without limit at a
rate of several million records per day. Mining these con-
tinuous data streams brings unique opportunities, but also
new challenges. This paper describes and evaluates VFDT,
an anytime system that builds decision trees using constant
memory and constant time per example. VFDT can in-
corporate tens of thousands of examples per second using
o-the-shelf hardware. It uses Hoeding bounds to guar-
antee that its output is asymptotically nearly identical to
that of a conventional learner. We study VFDT's proper-
ties and demonstrate its utility through an extensive set of
experiments on synthetic data. We apply VFDT to mining
the continuous stream of Web access data from the whole
University of Washington main campus.
INTRODUCTION
Knowledge discovery systems are constrained by three main
limited resources: time, memory and sample size. In tradi-
tional applications of machine learning and statistics, sample
size tends to be the dominant limitation: the computational
resources for a massive search are available, but carrying out
such a search over the small samples available (typically less
than 10,000 examples) often leads to overtting or \data
dredging" (e.g., [22, 16]). Thus overtting avoidance be-
comes the main concern, and only a fraction of the available
computational power is used [3].
HOEFFDING TREES
The classication problem is generally dened as follows. A
set of N training examples of the form (x; y) is given, where
y is a discrete class label and x is a vector of d attributes,
each of which may be symbolic or numeric. The goal is to
produce from these examples a model y = f(x) that will pre-
dict the classes y of future examples x with high accuracy.
For example, x could be a description of a client's recent
purchases, and y the decision to send that customer a cat-
alog or not; or x could be a record of a cellular-telephone
call, and y the decision whether it is fraudulent or not. One
of the most eective and widely-used classication methods
is decision tree learning [1, 15]. Learners of this type in-
duce models in the form of decision trees, where each node
contains a test on an attribute, each branch from a node
corresponds to a possible outcome of the test, and each leaf
contains a class prediction.
THE VFDT SYSTEM
We have implemented a decision-tree learning system based
on the Hoeding tree algorithm, which we call VFDT (Very
Fast Decision Tree learner). VFDT allows the use of either
information gain or the Gini index as the attribute evalu-
ation measure.
EMPIRICAL STUDY
Synthetic data
A system like VFDT is only useful if it is able to learn
more accurate trees than a conventional system, given simi-
lar computational resources. In particular, it should be able
to use to advantage the examples that are beyond a con-
ventional system's ability to process. In this section we test
this empirically by comparing VFDT with C4.5 release 8
[15] on a series of synthetic datasets. Using these allows us
to freely vary the relevant parameters of the learning pro-
cess. In order to ensure a fair comparison, we restricted the
two systems to using the same amount of RAM. This was
done by setting VFDT's \available memory" parameter to
40MB, and giving C4.5 the maximum number of examples
that would t in the same memory (100k examples).3 VFDT
used information gain as the G function. Fourteen concepts
were used for comparison, all with two classes and 100 bi-
nary attributes.
Lesion studies
We conducted a series of lesion studies to evaluate the eec-
tiveness of some of the components and parameters of the
VFDT system. Figure 6 shows the accuracy of the learners
on the (0.25, 0.00, 25209, 12605) data set. It also shows a
slight modication to the VFDT-boot algorithm, where the
tree produced by C4.5 is used without rst over-pruning it.
All versions of VFDT were run with = 10