01-10-2012, 03:56 PM
Introduction to Data Mining
data mining 2.pdf (Size: 1.3 MB / Downloads: 31)
Introduction
1. Discuss whether or not each of the following activities is a data mining
task.
(a) Dividing the customers of a company according to their gender.
No. This is a simple database query.
(b) Dividing the customers of a company according to their profitability.
No. This is an accounting calculation, followed by the application
of a threshold. However, predicting the profitability of a new
customer would be data mining.
© Computing the total sales of a company.
No. Again, this is simple accounting.
(d) Sorting a student database based on student identification numbers.
No. Again, this is a simple database query.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the
die were not fair, and we needed to estimate the probabilities of
each outcome from the data, then this is more like the problems
considered by data mining. However, in this specific case, solutions
to this problem were developed by mathematicians a long
time ago, and thus, we wouldn’t consider it to be data mining.
Data
1. In the initial example of Chapter 2, the statistician says, “Yes, fields 2 and
3 are basically the same.” Can you tell from the three lines of sample data
that are shown why she says that?
Field 2
Field 3
7 for the values displayed. While it can be dangerous to draw conclusions
from such a small sample, the two fields seem to contain essentially
the same information.
2. Classify the following attributes as binary, discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative (interval or
ratio). Some cases may have more than one interpretation, so briefly indicate
your reasoning if you think there may be some ambiguity.
Exploring Data
1. Obtain one of the data sets available at the UCI Machine Learning Repository
and apply as many of the different visualization techniques described in the
chapter as possible. The bibliographic notes and book Web site provide
pointers to visualization software.
MATLAB and R have excellent facilities for visualization. Most of the figures
in this chapter were created using MATLAB. R is freely available from
http://www.r-project.
2. Identify at least two advantages and two disadvantages of using color to
visually represent information.
Advantages: Color makes it much easier to visually distinguish visual elements
from one another. For example, three clusters of two-dimensional
points are more readily distinguished if the markers representing the points
have different colors, rather than only different shapes. Also, figures with
color are more interesting to look at.
Disadvantages: Some people are color blind and may not be able to properly
interpret a color figure. Grayscale figures can show more detail in some cases.
Color can be hard to use properly. For example, a poor color scheme can be
garish or can focus attention on unimportant elements.