13-02-2013, 09:47 AM
DATA MINING WITH CLUSTERING AND CLASSIFICATION
DATA MINING .ppt (Size: 922 KB / Downloads: 39)
Definition
Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
Clustering is “the process of organizing objects into groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Why clustering?
A few good reasons ...
Simplifications
Pattern detection
Useful in data concept construction
Unsupervised learning process
Where to use clustering?
Data mining
Information retrieval
text mining
Web analysis
marketing
medical diagnostic
Which method should I use?
Type of attributes in data
Scalability to larger dataset
Ability to work with irregular data
Time cost
complexity
Data order dependency
Result presentation
Measuring Similarity
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j)
There is a separate “quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
Weights should be associated with different variables based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
general steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:
Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.
Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.
Compute distances (similarities) between the new cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into K number of clusters
K-mean algorithm
It accepts the number of clusters to group data into, and the dataset to cluster as input values.
It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are 10,000 rows of data in the dataset and 3 clusters need to be formed, then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters. Each of the 3 initial clusters formed will have just one row of data.
Classification Examples
Teachers classify students’ grades as A, B, C, D, or F.
Identify mushrooms as poisonous or edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition