28-06-2013, 02:13 PM
Data Mining: Characterization
Data Mining.ppt (Size: 228 KB / Downloads: 190)
What is Concept Description?
Descriptive vs. predictive data mining
Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms
Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data
Concept description:
Characterization: provides a concise and succinct summarization of the given collection of data
Comparison: provides descriptions comparing two or more collections of data
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
Collect the task-relevant data( initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.
Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
Interactive presentation with users.
Mining Data Dispersion Characteristics
Motivation
To better understand the data: central tendency, variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
Boxplot Analysis
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum