10-08-2012, 03:30 PM
Data Mining
DataMining.pdf (Size: 5.36 MB / Downloads: 52)
Data mining and machine learning
We are overwhelmed with data. The amount of data in the world, in our lives,
seems to go on and on increasing—and there’s no end in sight. Omnipresent
personal computers make it too easy to save things that previously we would
have trashed. Inexpensive multigigabyte disks make it too easy to postpone decisions
about what to do with all this stuff—we simply buy another disk and keep
it all. Ubiquitous electronics record our decisions, our choices in the supermarket,
our financial habits, our comings and goings.We swipe our way through
the world, every swipe a record in a database. The World Wide Web overwhelms
us with information; meanwhile, every choice we make is recorded.And all these
are just personal choices: they have countless counterparts in the world of commerce
and industry.We would all testify to the growing gap between the generation
of data and our understanding of it. As the volume of data increases,
inexorably, the proportion of it that people understand decreases, alarmingly.
Lying hidden in all this data is information, potentially useful information, that
is rarely made explicit or taken advantage of.
This book is about looking for patterns in data. There is nothing new about
this. People have been seeking patterns in data since human life began. Hunters
seek patterns in animal migration behavior, farmers seek patterns in crop
growth, politicians seek patterns in voter opinion, and lovers seek patterns in
their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data,
to discover the patterns that govern how the physical world works and encapsulate
them in theories that can be used for predicting what will happen in new
situations. The entrepreneur’s job is to identify opportunities, that is, patterns
in behavior that can be turned into a profitable business, and exploit them.
In data mining, the data is stored electronically and the search is automated—
or at least augmented—by computer. Even this is not particularly new. Economists,
statisticians, forecasters, and communication engineers have long worked
4 CHAPTER 1 | WHAT’S IT ALL ABOUT?
with the idea that patterns in data can be sought automatically, identified,
validated, and used for prediction. What is new is the staggering increase in
opportunities for finding patterns in data. The unbridled growth of databases
in recent years, databases on such everyday activities as customer choices, brings
data mining to the forefront of new business technologies. It has been estimated
that the amount of data stored in the world’s databases doubles every 20
months, and although it would surely be difficult to justify this figure in any
quantitative sense, we can all relate to the pace of growth qualitatively. As the
flood of data swells and machines that can undertake the searching become
commonplace, the opportunities for data mining increase. As the world grows
in complexity, overwhelming us with the data it generates, data mining becomes
our only hope for elucidating the patterns that underlie it. Intelligently analyzed
data is a valuable resource. It can lead to new insights and, in commercial settings,
to competitive advantages.
Data mining is about solving problems by analyzing data already present in
databases. Suppose, to take a well-worn example, the problem is fickle customer
loyalty in a highly competitive marketplace. A database of customer choices,
along with customer profiles, holds the key to this problem. Patterns of
behavior of former customers can be analyzed to identify distinguishing characteristics
of those likely to switch products and those likely to remain loyal. Once
such characteristics are found, they can be put to work to identify present customers
who are likely to jump ship. This group can be targeted for special treatment,
treatment too costly to apply to the customer base as a whole. More
positively, the same techniques can be used to identify customers who might be
attracted to another service the enterprise provides, one they are not presently
enjoying, to target them for special offers that promote this service. In today’s
highly competitive, customer-centered, service-oriented economy, data is the
raw material that fuels business growth—if only it can be mined.
Data mining is defined as the process of discovering patterns in data. The
process must be automatic or (more usually) semiautomatic. The patterns
discovered must be meaningful in that they lead to some advantage, usually
an economic advantage. The data is invariably present in substantial
quantities.
How are the patterns expressed? Useful patterns allow us to make nontrivial
predictions on new data. There are two extremes for the expression of a pattern:
as a black box whose innards are effectively incomprehensible and as a transparent
box whose construction reveals the structure of the pattern. Both, we are
assuming, make good predictions. The difference is whether or not the patterns
that are mined are represented in terms of a structure that can be examined,
reasoned about, and used to inform future decisions. Such patterns we call structural
because they capture the decision structure in an explicit way. In other
words, they help to explain something about the data.
DATA MINING AND MACHINE LEARNING 5
Now, finally, we can say what this book is about. It is about techniques for
finding and describing structural patterns in data. Most of the techniques that
we cover have developed within a field known as machine learning. But first let
us look at what structural patterns are.
Describing structural patterns
What is meant by structural patterns? How do you describe them? And what
form does the input take? We will answer these questions by way of illustration
rather than by attempting formal, and ultimately sterile, definitions. There will
be plenty of examples later in this chapter, but let’s examine one right now to
get a feeling for what we’re talking about.
Look at the contact lens data in Table 1.1. This gives the conditions under
which an optician might want to prescribe soft contact lenses, hard contact
lenses, or no contact lenses at all; we will say more about what the individual