24-11-2012, 02:44 PM
DECISION TREE INDUCTION
DECISION TREE.pptx (Size: 87.38 KB / Downloads: 28)
what is Classification?
Classification:
predicts categorical class labels.
classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data. Given a collection of records (training set ). Each record contains a set of attributes, one of the attributes is the class.
classification-is two way process
1)Model construction:
describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.
The set of tuples used for model construction is training set.
The model is represented as classification rules, decision trees.
2)Model usage:
for classifying future or unknown objects.
Estimate accuracy of the model.
If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known.
Issues for classification and prediction
There are two issues for classification and prediction.
1)Data preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing values.
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes.
Data transformation
Generalize and/or normalize data.
2)Evaluation
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
What is decision tree induction?
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.
Decision tree generation –two phases
Decision tree generation consists of two phases
1)Tree construction
At start, all the training examples are at the root.
Partition examples recursively based on selected attributes.
2)Tree pruning
Identify and remove branches that reflect noise or outliers.
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree.
INFORMATION GAIN
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
First the entropy of the total dataset is calculated.
The dataset is then split on the different attributes.
The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split.