Mining HighSpeed Data Streams pdf

**study tips** · 04-06-2013, 01:08 PM

Mining HighSpeed Data Streams

.pdf

Mining HighSpeed.pdf (Size: 134.03 KB / Downloads: 14)

ABSTRACT

Many organizations today have more than very large data-
bases; they have databases that grow without limit at a
rate of several million records per day. Mining these con-
tinuous data streams brings unique opportunities, but also
new challenges. This paper describes and evaluates VFDT,
an anytime system that builds decision trees using constant
memory and constant time per example. VFDT can in-
corporate tens of thousands of examples per second using
o-the-shelf hardware. It uses Hoeding bounds to guar-
antee that its output is asymptotically nearly identical to
that of a conventional learner. We study VFDT's proper-
ties and demonstrate its utility through an extensive set of
experiments on synthetic data. We apply VFDT to mining
the continuous stream of Web access data from the whole
University of Washington main campus.

INTRODUCTION

Knowledge discovery systems are constrained by three main
limited resources: time, memory and sample size. In tradi-
tional applications of machine learning and statistics, sample
size tends to be the dominant limitation: the computational
resources for a massive search are available, but carrying out
such a search over the small samples available (typically less
than 10,000 examples) often leads to overtting or \data
dredging" (e.g., [22, 16]). Thus overtting avoidance be-
comes the main concern, and only a fraction of the available
computational power is used [3].

HOEFFDING TREES

The classication problem is generally dened as follows. A
set of N training examples of the form (x; y) is given, where
y is a discrete class label and x is a vector of d attributes,
each of which may be symbolic or numeric. The goal is to
produce from these examples a model y = f(x) that will pre-
dict the classes y of future examples x with high accuracy.
For example, x could be a description of a client's recent
purchases, and y the decision to send that customer a cat-
alog or not; or x could be a record of a cellular-telephone
call, and y the decision whether it is fraudulent or not. One
of the most eective and widely-used classication methods
is decision tree learning [1, 15]. Learners of this type in-
duce models in the form of decision trees, where each node
contains a test on an attribute, each branch from a node
corresponds to a possible outcome of the test, and each leaf
contains a class prediction.

THE VFDT SYSTEM

We have implemented a decision-tree learning system based
on the Hoeding tree algorithm, which we call VFDT (Very
Fast Decision Tree learner). VFDT allows the use of either
information gain or the Gini index as the attribute evalu-
ation measure.

EMPIRICAL STUDY

Synthetic data

A system like VFDT is only useful if it is able to learn
more accurate trees than a conventional system, given simi-
lar computational resources. In particular, it should be able
to use to advantage the examples that are beyond a con-
ventional system's ability to process. In this section we test
this empirically by comparing VFDT with C4.5 release 8
[15] on a series of synthetic datasets. Using these allows us
to freely vary the relevant parameters of the learning pro-
cess. In order to ensure a fair comparison, we restricted the
two systems to using the same amount of RAM. This was
done by setting VFDT's \available memory" parameter to
40MB, and giving C4.5 the maximum number of examples
that would t in the same memory (100k examples).3 VFDT
used information gain as the G function. Fourteen concepts
were used for comparison, all with two classes and 100 bi-
nary attributes.

Lesion studies

We conducted a series of lesion studies to evaluate the eec-
tiveness of some of the components and parameters of the
VFDT system. Figure 6 shows the accuracy of the learners
on the (0.25, 0.00, 25209, 12605) data set. It also shows a
slight modication to the VFDT-boot algorithm, where the
tree produced by C4.5 is used without rst over-pruning it.
All versions of VFDT were run with = 10

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Software Crisis pdf	study tips	1	2,117	21-09-2017, 04:31 PM Last Post: jaseela123
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	HOW EMAIL WORKS pdf	project girl	1	3,067	20-09-2017, 11:39 AM Last Post: jaseela123
	Cyber crime detection, investigation and prosecution pdf	seminar projects maker	1	958	20-09-2017, 11:31 AM Last Post: jaseela123
	Review: Context Aware Tools for Smart Home Development pdf	study tips	1	1,227	20-09-2017, 11:22 AM Last Post: jaseela123
	DEMONSTRATING DATAPOSSESSION AND UN CHEATABLE DATA TRANSFER	seminar flower	1	1,466	19-09-2017, 11:05 AM Last Post: jaseela123
	Getting Started with the MAXQ1103 Evaluation Kit and the CrossWorks Compiler pdf	project girl	1	969	15-09-2017, 03:11 PM Last Post: jaseela123
	Wireless Application Protocol (WAP) pdf	project girl	1	1,531	15-09-2017, 02:42 PM Last Post: jaseela123
	Processing of collected data PPT	seminar projects maker	1	718	15-09-2017, 12:48 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.