02-01-2013, 11:54 AM
Cluster Analysis
1Cluster Analysis.ppt (Size: 674 KB / Downloads: 351)
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Clustering: Rich Applications and Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
Weights should be associated with different variables based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability