10-07-2012, 03:56 PM
Analysis of Data Mining Algorithms
Analysis of Data Mining Algorithms.pdf (Size: 237.48 KB / Downloads: 35)
1. Introduction
With an enormous amount of data stored in databases and data warehouses, it is increasingly important to develop powerful tools for analysis of such data and mining interesting knowledge from it. Data mining is a process of inferring knowledge from such huge data. Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis.
By simple definition, in classification/clustering we analyze a set of data and generate a set of grouping rules which can be used to classify future data. For example, one may classify diseases and provide the symptoms which describe each class or subclass. This has much in common with traditional work in statistics and machine learning. However, there are important new issues which arise because of the sheer size of the data. One of the important problem in data mining is the Classification-rule learning which involves finding rules that partition given data into predefined classes. In the data mining domain where millions of records and a large number of attributes are involved, the execution time of existing algorithms can become prohibitive, particularly in interactive applications. This is discussed in detail in Chapter 2.
An association rule is a rule which implies certain association relationships among a set of objects in a database. In this process we discover a set of association rules (in the form of ``A1 … A i B1 … B j'') at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms often occurring together with certain kinds of diseases and further study the reasons behind them. Since finding interesting association rules in databases may disclose some useful patterns for decision support, selective marketing, financial forecast, medical diagnosis, and many other applications, it has attracted a lot of attention in recent data mining research . Mining association rules may require iterative scanning of large transaction or relational databases which is quite costly in processing. Therefore, efficient mining of association rules in transaction and/or relational databases has been studied substantially. This is discussed in detail in Chapter 3.
In sequential Analysis, we seek to discover patterns that occur in sequence. This deals with data that appear in separate transactions (as opposed to data that appear in the same transaction in the case of association).For e.g. : If a shopper buys item A in the first week of the month, then s/he buys item B in the second week etc. This is discussed in detail in Chapter 4.
There are many algorithms proposed that try to address the above aspects of data mining. Compiling a list of all algorithms suggested/used for these problems is an arduous task . I have thus limited the focus of this report to list only some of the algorithms that have had better success than the others. I have included a list of URLs in Appendix A which can be referred to for more information on data mining algorithms.
Data Mining Algorithms
4
2. Classification Algorithms
In Data classification one develops a description or model for each class in a database, based on the features present in a set of class-labeled training data. There have been many data classification methods studied, including decision-tree methods, such as C4.5, statistical methods, neural networks, rough sets, database-oriented methods etc.
2.1 Data Classification Methods
In this paper, I have discussed in detail some of the machine-learning algorithms that have been successfully applied in the initial stages of this field. The other methods listed above are just being applied to data mining and have not been very successful. This section briefly describes these other methods. Appendix A lists the URLs which can be referred to for more information on these various methods.
Statistical Algorithms Statistical analysis systems such as SAS and SPSS have been used by analysts to detect unusual patterns and explain patterns using statistical models such as linear models. Such systems have their place and will continue to be used.
Neural Networks Artificial neural networks mimic the pattern-finding capacity of the human brain and hence some researchers have suggested applying Neural Network algorithms to pattern-mapping. Neural networks have been applied successfully in a few applications that involve classification.
Genetic algorithms Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
Nearest neighbor method A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the k-nearest neighbor technique.
Rule induction The extraction of useful if-then rules from data based on statistical significance.
Data visualization The visual interpretation of complex relationships in multidimensional data.
2.2 Data Abstraction
Many existing algorithms suggest abstracting the test data before classifying it into various classes. There are several alternatives for doing abstraction before classification: A data set can be generalized to either a minimally generalized abstraction level, an intermediate abstraction level, or a rather high abstraction level. Too low an abstraction level may result in scattered classes, bushy classification trees, and difficulty at concise semantic interpretation; whereas too high a level may result in the loss of classification accuracy. The generalization-based multi-level classification process has been implemented in the DB- Miner system.[4]
Data Mining Algorithms
5
2.3 Classification-rule learning.
Classification-rule learning involves finding rules or decision trees that partition given data into predefined classes. For any realistic problem domain of the classification-rule learning, the set of possible decision trees is too large to be searched exhaustively. In fact, the computational complexity of finding an optimal classification decision tree is NP hard.
Most of the existing induction-based algorithms use Hunt's method as the basic algorithm.[2] Here is a recursive description of Hunt's method for constructing a decision tree from a set T of training cases with classes denoted {C1, C2, … ,Ck }.
Case 1 T contains one or more cases, all belonging to a single class Cj : The decision tree for T is a leaf identifying class Cj .
Case 2 T contains no cases: The decision tree for T is a leaf, but the class to be associated with the leaf must be determined from information other than T.
Case 3 T contains cases that belong to a mixture of classes: A test is chosen, based on a single attribute, that has one or more mutually exclusive outcomes {O1, O2 , .. ,On }. T is partitioned into subsets T1, T2, … ,Tn , where Ti contains all the cases in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test, and one branch for each possible outcome. The same tree building machinery is applied recursively to each subset of training cases.