A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

**seminar code** · 05-09-2014, 03:51 PM

A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

.docx

A Fast Clustering.docx (Size: 22.82 KB / Downloads: 12)

Abstract

Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

Existing System

It weights each feature according to its ability to discriminate instances under different targets based on distance-based criteria function. However, Relief is ineffective at removing redundant features as two predictive but highly correlated features are likely both to be highly weighted

Relief-F

It extends Relief, enabling this method to work with noisy and incomplete data sets and to deal with multi-class problems, but still cannot identify redundant features

Hierarchical clustering

The hierarchical clustering also has been used to select features on spectral data. Van Dijk and Van Hullefor proposed a hybrid filter/wrapper feature subset selection algorithm for regression. Krier et al. presented a methodology combining hierarchical constrained clustering of spectral variables and selection of clusters by mutual information. Their feature clustering method is similar to that of Van Dijk and Van Hullefor except that the former forces every cluster to contain consecutive features only. Both methods employed agglomerative hierarchical clustering to remove redundant features.

Proposed System

In this paper we propose a new FAST Clustering based feature subset selection for high dimensional data. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form the final subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features.
It involves
 The construction of the minimum spanning tree (MST) from a weighted complete graph;
 The partitioning of the MST into a forest with each tree representing a cluster;
 The selection of representative features from the clusters

Modules Description

Irrelevant Feature Removal

The irrelevant feature removal is straightforward once the right relevance measure is defined or selected, while the redundant feature elimination is a bit of sophisticated. In our proposed FAST algorithm. It involves,
In this module we remove the irrelevant feature from the dataset. For a data set, features and class, we compute the T-Relevance value for each feature in the first step. The features whose values are greater than a predefined threshold comprise the target-relevant feature subset.

Tree Partition and Representative Feature Selection

In this module remove the edges, whose weights are smaller than both of the T-Relevance and from the MST. Each deletion results in two disconnected trees ?1 and ?2. After removing all the unnecessary edges, a forest Forest is obtained. Each tree Forest represents a cluster that is denoted as ? (?), which is the vertex set of ? as well. As illustrated above, the features in each cluster are redundant, so for each cluster we choose a representative feature whose T-Relevance is the greatest. All Forest comprise the final feature subset.
partitioning the MST and selecting representative features. In the proposed algorithm, a cluster consists of features. Each cluster is treated as a single feature and thus dimensionality is drastically reduced. Generally, the proposed algorithm obtained the best proportion of selected features, the best runtime, and the best classification accuracy confirmed the conclusions. We have presented a clustering-based feature subset selection algorithm for high dimensional data

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	Development of a workflow based Complaint Management System (where the complaints are	mechanical engineering crazy	2	28,844,331	26-11-2018, 12:11 PM Last Post: Guest
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	RIA based E- Shopping Portal for Electronic Gadgets Report	study tips	1	1,588	21-09-2017, 01:25 PM Last Post: jaseela123
	THREE DIMENSIONAL PASSWORD FOR MORE SECURE AUTHENTICATION A MAIN PROJECT REPORT	study tips	1	1,300	21-09-2017, 12:56 PM Last Post: jaseela123
	Design and Implementation of High-Performance FPGA Signal Processing Datapaths	seminar class	1	300,513	20-09-2017, 01:31 PM Last Post: jaseela123
	Detecting False Data in Wireless Sensor Network using Efficient Becan Scheme	seminar tips	1	3,235	20-09-2017, 01:03 PM Last Post: jaseela123
	Different Initialization Data and the Performance by the BFM	seminar flower	1	680	20-09-2017, 12:44 PM Last Post: jaseela123
	Multiple Routing Configurations for Fast IP Network Recovery full report	seminar tips	1	964	19-09-2017, 03:27 PM Last Post: jaseela123
	Wide Area Mobile Data Services	seminar ideas	1	2,373	19-09-2017, 02:35 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.