06-09-2016, 04:04 PM
Performance comparison of hierarchical versus partitioning algorithms for mining text data with side information
1453318782-conference1.doc (Size: 147.5 KB / Downloads: 10)
ABSTRACT:
Lots of side-information is available in the online forms. This side-information is different kinds like the links in the document, user-access behavior from web logs, document provenance information, and other non-textual attributes which are embedded into the text document. This attributes may contain a huge amount of information for clustering purposes. This side-information may be difficult to estimate, especially when some of the information is noisy. It can be risky to integrate side-information into the mining process, because it can add noise to the process.
The existing algorithm which combines partitioning algorithms with probabilistic models in order to create an effective clustering approach. But, k-means algorithm has some problems that it can’t able to improve the quality of clusters. Because, it leads to different sized clusters, leads to wrong number of clusters and leads to outliers and empty clusters
To design an algorithm which combines iterative hierarchical algorithms with probabilistic models in order to create an effective clustering approach. CRUE algorithm is more efficient than k-means algorithm for mining text data with help of side information. CURE is to attempt to the scalability problem and improve the quality of clustering results.
1. INTRODUCTION
1.1. Data Mining
The past two decades there is a huge amount of data available in the Information Industry. This data is of no use until it is converted into useful information. It is necessary to analyze this huge amount of data and extract useful information from it Extraction of information is not the only process we need to perform; data mining also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration, etc.
1.1.1. What is Data Mining?
Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
1.2. DATA MINING CLASSIFICATION:
Data Mining consists of two tasks
1) Descriptive tasks
2) Predictive tasks
1.2.1. Descriptive tasks :
To derive patterns that summarize the underlying relationship between data. The descriptive function deals with the general properties of data in the database. Here is the list of descriptive functions.
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
1.2.1.1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in a company, the classes of items for sales include computer and printers, and concepts of customers include big spenders and budget spenders. These descriptions can be derived by the following two ways −
• Data Characterization − This refers to summarizing data of class under study. This class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class.
1.2.1.2. Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together, for example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural forms, such as graphs, trees, or lattices, which may be combined with item−sets or subsequences.
1.2.1.3. Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased together. This process refers to the process of uncovering the relationship among data and determining association rules. For example, a retailer generates an association rule that shows that 70% of time milk is sold with bread and only 30% of times biscuits are sold with bread.
1.2.1.4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations between associated-attribute−value pairs or between two item sets to analyze that if they have positive, negative or no effect on each other.
1.2.1.5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are highly different from the objects in other clusters.
1.2.2. Predictive tasks -:
Predict the value of the attribute based on the value of other attributes. The predictive function can be presented in the following forms.
1.2.2.1 Classification:
Classification is a process that is used for dividing data into different classes according to some constraints. In other word we can say that classification is the process of organizing the data according to different instance.
There are so many kinds of classification algorithms include
• Decision Tree
• Neural Network Classifier.
1.2.2.1.1. Decision Trees:
Decision tree is a classification model in the form of a tree structure that includes root node which is the top most node in the tree, branch node that denote the outcome of the text and leaf node which hold the class label. In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. Most frequent cases, every test taken as a single attribute. So that instance space is partitioned according to the attribute’s value. But in the case of numeric attributes, it refers to a range. Leaf node is assigned to one class representing the most appropriate target value. Iteratively, the leaf may hold a probability vector indicating the probability of the target attribute having a certain value.
1.2.2.2. Neural Network Classifier:
In data-rich environments neural networks are suitable and mostly used for extracting embedded knowledge in the form of clustering, self-organization, feature evaluation and dimensionality reduction, classification and regression. There are many nice features of neural networks, which make them attractive for data mining. These features include fault tolerance, learning and generalization ability, content addressability, robustness, self-organization and simplicity of basic computations.
1.3. Data mining Process:
The phases depicted start with the raw data and finish with the extracted knowledge which was acquired as a result of the following stages:
1.3.1. Data Integration: Initially data will be collected and integrated from all the sources.
1.3.2. Data Selection: Usually we don’t require all the data that we have collected in earlier step. So we need to filter the data which are useful for data mining.
1.3.3. Data Cleaning: Data collected could contain errors, missing values, noisy or inconsistent data. So we need to apply various mechanisms to get rid of these irregularities.
1.3.4. Data Transformation: Normally cleaned data are not ready for mining since we are supposed to transform them into forms suitable for mining. The mechanisms used to achieve this are smoothing, aggregation, normalization.
1.3.5. Data Mining: Finally to apply data mining techniques on the transformed data to identify the interesting patterns. Traditional techniques like clustering and association analysis are some of the popular mechanisms used for data mining.
1.4. Cluster:
Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster.(or)
Cluster is a collection of Data objects. Similar to one another within the same cluster and Dissimilar to the objects in other clusters.
1.4.1. What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Main Points
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.
1.4.2. Cluster analysis
Cluster analysis –Used to find similarities between data as per the characteristics found in the data and grouping similar data objects into clusters.
1.4.3. A good clustering
A good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
1.4.4. Requirements of Clustering in Data Mining
Here are the typical requirements of clustering in data mining:
• Scalability - We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kind of attributes - Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data.
• Discovery of clusters with attribute shape - The clustering algorithm should be capable of detect cluster of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small size.
• High dimensionality - The clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space.
• Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
• Interpretability - The clustering results should be interpretable, comprehensible and usable
1.4.5. Clustering general Approaches:
• Hierarchical Approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Partitioning Approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors. Typical methods: k-means,
k-medians, CLARANS
• Density-based Approach: Based on connectivity and density functions. Typical methods: DBSACN, OPTICS
• Grid-based Approach: Based on a multiple-level granularity structure. Typical methods: STING, WaveCluster, CLIQUE
• Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other. Typical methods: EM, SOM, COBWEB
• Frequent pattern-based: Based on the analysis of frequent patterns. Typical methods: pCluster
2. ISSUES
Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding –
• Mining Methodology and User Interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Handling noisy or incomplete data
– Pattern evaluation
• Performance Issues
– Efficiency and scalability
– Parallel, distributed, and incremental
• Diverse Data Types Issues
– Handling of relational and complex types of data
– Mining information from heterogeneous databases and global information systems
3. MOTIVATIONS
Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct. Attributes which rely on subjective or measurement judgments can give rise to errors such that some examples may even be miss-classified. Errors in either the values of attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects the overall accuracy of the generated rules.
Noisy data in the sense of being imprecise is characteristic of all data collection and typically fit a regular statistical distribution such as Gaussian while wrong values are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise.
4.1.1. Hierarchical clustering approaches:
A hierarchical clustering algorithm divides the given data set into smaller subsets in hierarchical manner. The hierarchical methods group the data instances into a tree of clusters. There are two major methods are available under this category i.e agglomerative method, which forms the clusters in a bottom-up fashion until all data instances belong to the same cluster, divisive method, in which splits up the data set into smaller cluster in a top-down fashion until each of cluster contains only one instance. Both the divisive algorithms and agglomerative algorithms can be represented by dendrograms. Hierarchical clustering techniques use various constraints to decide locally at each step which clusters should be joined . For agglomerative hierarchical techniques, the principle is typically to merge the “closest” pair of clusters, where “close” is defined by a specified measure of cluster closeness. There are three definitions of the closeness between two clusters i.e single link, complete link and average link. The single link similarity of two clusters is the similarity between the two most similar instances. Single link is good for handling non elliptical shapes. The complete link is less susceptible to noise and outliers and has trouble with convex shapes.
Advantages:
• It is very flexible for the level of granularity
• It is easier for handling of any forms of distance or similarity
• Applicable to any attribute types
Disadvantages:
• Once we design hierarchical algorithms, it becomes hard to reconstruct clusters with the purpose of their improvement
The following are some of the hierarchical clustering algorithms are: Balanced Iterative Reducing and Clustering using Hierarchies – BIRCH, Clustering Using REpresentatives – CURE and CHAMELEON.
2.1.1.1. CURE:-
CURE is a hierarchical clustering algorithm, that employs the features of both the centroid based algorithms and the all point algorithms .CURE obtains a data sample from the given database. The algorithm divides the data sample into groups and identifies some representative points from each group of the data sample. In this hierarchical algorithm, the value of the factor α may vary between 0 and 1. The utilization of the shrinking factor alpha by the CURE overcomes the limitations of the centroid based, all-points approaches. As the representative points are moved through the clustering space, the ill effects of outliers are reduced by a greater extent. Thus the feasibility of CURE is enhanced by the shrinking factor α. The worst case time complexity of CURE is determined to be O (n2logn).
Advantages
• Attempt to address the scalability problem and improve the quality of clustering results
Disadvantages
• These methods do not scale well with the number of data objects
S. Guha, R. Rastogi, and K. Shim[7] demonstrates that CURE is An Efficient Clustering Algorithm for Large Databases. CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms.
2.1.1.2. BIRCH:-
The clustering algorithm BIRCH is a main memory based algorithm, i.e., the clustering process is carried out with a memory constraint. BIRCH‟s incremental clustering is based on the concept of clustering feature and CF tree. A clustering feature is a triple that contains the summary of the information of each cluster.
Advantages
• Suitable large scale data sets in main memory
• Minimize number of scans
• I/O costs are very low
Disadvantages
• Suffers from identifying only convex or spherical clusters of uniform size
M. Livny, R.Ramakrishnan and T. Zhang[6]. demonstrates that BIRCH is An Efficient Clustering Method for Very Large Databases. The data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle “noise’’ (data points that are not part of the underlying pattern) effectively. They evaluate BIRCH’S time/space efficiency, data input order sensitivity, and clustering quality through several experiments. The author also presents a performance comparison of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
2.1.1.3. CHAMELEON:-
In agglomerative hierarchical approaches the major disadvantage is that they are based on a static, user specified inter connectivity model,either under estimates or over estimates the inter connectivity of objects and clusters. This type limitation is overcome by the algorithm CHAMELEON. CHAMELEON makes use of a sparse graph, where the nodes represent data objects; weights in the edges represent similarities between the data objects. CHAMELEON‟s sparse graph implementation lets it to scale to large databases in an effective manner. CHAMELEON agglomerative hierarchical approach implements the algorithm in two separate phases. In the first phase, dynamic modeling of the data objects is done by clustering these objects into subclusters. In the second phase, a dynamic modeling framework is employed on the data objects to merge the subclusters in a hierarchical manner to get good quality cluster. The dynamic framework model can be implemented by two different methods. In the first method it is checked that the values of relative inter-connectivity and relative closeness between a pair of cluster cross a user-specified threshold value. For this purpose, these two parameters should satisfy the following conditions:
Advantages
• Hierarchical clustering come at the cost of lower efficiency.
• CHAMELEON strongly relies on graph partitioning implemented in the library HMETIS
Disadvantages
• Issue of scaling to large data sets that cannot fit in the main memory
George Karypis, Eui-Hong (Sam) and Han Vipin Kumar[8] demonstrates that a novel hierarchical clustering algorithm called CHAMELEON that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are high relative to the internal inter-connectivity of the clusters and closeness of items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in CHAMELEON is applicable to all types of data as long as a similarity matrix can be constructed. We demonstrate the effectiveness of CHAMELEON in a number of data sets that contain points in 2D space, and contain clusters of different shapes, densities, sizes, noise, and artifacts. Experimental results on these data sets show that CHAMELEON can discover natural clusters that many existing state-of-the art clustering algorithms fail to find.
2.1.2. Partitioning methods:
Partitioning methods are divided into two subcategories, one is centroid and other is medoids algorithms. Centroid algorithms represent each cluster by using the gravity centre of the instances. The medoid algorithms represents each cluster by means of the instances closest to gravity centre. The well-known centroid algorithm is the k-means and mediods algorithm is k-medoids.
2.1.2.1. K-means Clustering:
The k-means method partitions the data set into k subsets such that all points in a given subset are closest to the same centre. In detail, it randomly selects k of the instances to represent the clusters.
Partitioned clustering approach
a) Each cluster is associated with a centroid(center point)
b) Each point is assigned to the cluster with the closest centroid
c) Number of clusters, K, must be specified
Advantages
• K-means is relatively scalable and efficient in processing large data sets
Disadvantages
• Different sized clusters
• Clusters of different Densities
• Non-globular clusters
• Wrong number of Clusters
• Outliers and empty Clusters
The author’s Tapas Kanungo and Angela Y. Wu[11] present that A popular heuristic for k-means clustering is Lloyd's algorithm. They present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. The author establishes the practical efficiency of the filtering algorithm in two ways. First, they present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, they present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.
shi na, Liu Xumin, Guan Yong [12] discusses the standard k-means clustering algorithm and analyzes the shortcomings of standard k-means algorithm, such as the k-means clustering algorithm has to calculate the distance between each data object and all cluster centers in each iteration, which makes the efficiency of clustering is not high. This paper proposes an improved k-means algorithm in order to solve this question, requiring a simple data structure to store some information in every iteration, which is to be used in the next iteration. The improved method avoids computing the distance of each data object to the cluster centers repeatly, saving the running time. Experimental results show that the improved method can effectively improve the speed of clustering and accuracy, reducing the computational complexity of the k-means
2.1.2.2. k-Medoids:
The k-Medoids: in k-medoids algorithm, Rather than calculate the mean of the items in each cluster, a representative item, or medoid, is chosen for each cluster at each iteration. Medoids for each cluster are calculated by finding object I within the cluster that minimizes where Ci is the cluster containing object i and d(i; j) is the distance between objects i and j. The closest medoid can be found by scanning through this array until a medoid is found, rather than comparing the distance of every medoid.