20-08-2012, 04:20 PM
Principles of Data Mining
Principles of Data Mining.pdf (Size: 3.69 MB / Downloads: 48)
Introduction to Data Mining
Progress in digital data acquisition and storage technology has resulted in the growth of
huge databases. This has occurred in all areas of human endeavor, from the mundane
(such as supermarket transaction data, credit card usage records, telephone call details,
and government statistics) to the more exotic (such as images of astronomical bodies,
molecular databases, and medical records). Little wonder, then, that interest has grown
in the possibility of tapping these data, of extracting from them information that might be
of value to the owner of the database. The discipline concerned with this task has
become known as data mining.
Defining a scientific discipline is always a controversial task; researchers often disagree
about the precise range and limits of their field of study. Bearing this in mind, and
accepting that others might disagree about the details, we shall adopt as our working
definition of data mining:
Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and
useful to the data owner.
The relationships and summaries derived through a data mining exercise are often
referred to as models or patterns. Examples include linear equations, rules, clusters,
graphs, tree structures, and recurrent patterns in time series.
The Nature of Data Sets
We begin by discussing at a high level the basic nature of data sets.
A data set is a set of measurements taken from some environment or process. In the
simplest case, we have a collection of objects, and for each object we have a set of the
same p measurements. In this case, we can think of the collection of the measurements
on n objects as a form of n × p data matrix. The n rows represent the n objects on which
measurements were taken (for example, medical patients, credit card customers, or
individual objects observed in the night sky, such as stars and galaxies). Such rows may
be referred to as individuals, entities, cases, objects, or records depending on the
context.
The other dimension of our data matrix contains the set of p measurements made on
each object. Typically we assume that the same p measurements are made on each
individual although this need not be the case (for example, different medical tests could
be performed on different patients). The p columns of the data matrix may be referred to
as variables, features, attributes, or fields; again, the language depends on the research
context. In all situations the idea is the same: these names refer to the measurement that
is represented by each column. In chapter 2 we will discuss the notion of measurement
in much more detail.
Types of Structure: Models and Patterns
The different kinds of representations sought during a data mining exercise may be
characterized in various ways. One such characterization is the distinction between a
global model and a local pattern.
A model structure, as defined here, is a global summary of a data set; it makes
statements about any point in the full measurement space. Geometrically, if we consider
the rows of the data matrix as corresponding to p-dimensional vectors (i.e., points in pdimensional
space), the model can make a statement about any point in this space (and
hence, any object). For example, it can assign a point to a cluster or predict the value of
some other variable. Even when some of the measurements are missing (i.e., some of
the components of the p-dimensional vector are unknown), a model can typically make
some statement about the object represented by the (incomplete) vector.
A simple model might take the form Y = aX + c, where Y and X are variables and a and c
are parameters of the model (constants determined during the course of the data mining
exercise). Here we would say that the functional form of the model is linear, since Y is a
linear function of X. The conventional statistical use of the term is slightly different. In
statistics, a model is linear if it is a linear function of the parameters. We will try to be
clear in the text about which form of linearity we are assuming, but when we discuss the
structure of a model (as we are doing here) it makes sense to consider linearity as a
function of the variables of interest rather than the parameters. Thus, for example, the
model structure Y = aX2 + bX + c, is considered a linear model in classic statistical
terminology, but the functional form of the model relating Y and X is nonlinear (it is a
second-degree polynomial).
Optimization and Search Methods
The score function is a measure of how well aspects of the data match proposed models
or patterns. Usually, these models or patterns are described in terms of a structure,
sometimes with unknown parameter values. The goal of optimization and search is to
determine the structure and the parameter values that achieve a minimum (or maximum,
depending on the context) value of the score function. The task of finding the "best"
values of parameters in models is typically cast as an optimization (or estimation)
problem. The task of finding interesting patterns (such as rules) from a large family of
potential patterns is typically cast as a combinatorial search problem, and is often
accomplished using heuristic search techniques. In linear regression, a prediction rule is
usually found by minimizing a least squares score function (the sum of squared errors
between the prediction from a model and the observed values of the predicted variable).
Such a score function is amenable to mathematical manipulation, and the model that
minimizes it can be found algebraically. In contrast, a score function such as
misclassification rate in supervised classification is difficult to minimize analytically. For
example, since it is intrinsically discontinuous the powerful tool of differential calculus
cannot be brought to bear.