30-04-2012, 01:20 PM
Data Mining and Clustering Techniques
datamining pdf.PDF (Size: 149.04 KB / Downloads: 89)
Introduction
Data mining, a synonym to “knowledge discovery in databases” is a process of
analyzing data from different perspectives and summarizing it into useful information.
It is a process that allows users to understand the substance of relationships between
data. It reveals patterns and trends that are hidden among the data. It is often viewed
as a process of extracting valid, previously unknown, non-trivial and useful
information from large databases. Data mining systems can be classified according to
the kinds of databases mined, the kinds of knowledge mined, the techniques used or
the applications. Three important components of data mining systems are databases,
data mining engine, and pattern evaluation modules.
Data Mining Techniques
Classification is a most important and frequently used technique in data mining. It is a
process of finding a set of models that describe and distinguish data classes or
concepts. The derived model may be represented in various forms such as
classification (IF-THEN) rules, decision tree, neural networking, etc.
A decision tree is a flowchart like tree structure when each node denotes a test on an
attribute value where each branch represents an outcome of the test, and tree leaves
represent classes. Decision trees can be easily converted to classification rules.
Cluster Analysis
The concept of clustering has been around for a long time. It has several applications,
particularly in the context of information retrieval and in organizing web resources.
The main purpose of clustering is to locate information and in the present day context,
to locate most relevant electronic resources. The research in clustering eventually led
to automatic indexing --- to index as well as to retrieve electronic records. Clustering
is a method in which we make cluster of objects that are some how similar in
characteristics. The ultimate aim of the clustering is to provide a grouping of similar
records. Clustering is often confused with classification, but there is some difference
between the two.
Basic Clustering Step
Preprocessing and feature selection
Most clustering models assume that n-dimensional feature vectors represent all data
items. This step therefore involves choosing an appropriate feature, and doing
appropriate preprocessing and feature extraction on data items to measure the values
of the chosen feature set. It will often be desirable to choose a subset of all the
features available, to reduce the dimensionality of the problem space. This step often
requires a good deal of domain knowledge and data analysis.
Similarity measure
Similarity measure plays an important role in the process of clustering where a set of
objects are grouped into several clusters, so that similar objects will be in the same
cluster and dissimilar ones in different cluster. In clustering, its features represent an
object and the similarity relationship between objects is measured by a similarity
function. This is a function, which takes two sets of data items as input, and returns as
output a similarity measure between them.
Clustering algorithm
Clustering algorithms are general schemes, which use particular similarity measures
as subroutines. The particular choice of clustering algorithms depends on the desired
properties of the final clustering, e.g. what are the relative importance of compactness,
parsimony, and inclusiveness? Other considerations include the usual time and space
complexity. A clustering algorithm attempts to find natural groups of components (or
data) based on some similarity. The clustering algorithm also finds the centroid of a
group of data sets. To determine cluster membership, most algorithms evaluate the
distance between a point and the cluster centroids. The output from a clustering
algorithm is basically a statistical description of the cluster centroids with the number
of components in each cluster (2).
Result validation
Do the results make sense? If not, we may want to iterate back to some prior stage. It
may also be useful to do a test of clustering tendency, to try to guess if clusters are
present at all; note that any clustering algorithm will produce some clusters regardless
of whether or not natural clusters exist.