04-09-2014, 11:51 AM
Purpose of Study
Purpose of Study.docx (Size: 70.99 KB / Downloads: 12)
INTRODUCTION
Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. Data mining provides analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
• Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
o Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
o Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
o Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Feature subset selection is an effective way for reducing dimensionality, removing irrelevant data, increasing learning accuracy and improving results comprehensibility. This process improved by cluster based FAST Algorithm. FAST Algorithm can be used to Identify and removing the irrelevant data set.
This algorithm process implements using two different steps that is graph theoretic clustering methods and representative feature cluster is selected. Feature subset selection research has focused on searching for relevant features.
Cluster Analysis: Clustering is the progression of grouping similar objects into one class. A cluster is an assembly of data objects that are similar to one another within the identical cluster and are dissimilar to the objects in other clusters.
Document clustering (Text clustering): Document clustering is closely related to the concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.
Data Preprocessing is used to improve the efficiency and ease of the mining process. Whenever They want to extract some data from the data warehouse that data may be incomplete, inconsistent or contain noisy because data warehouse collect and store the data from various external resources.
• Data preprocessing Techniques are:
• Data cleaning: Attempt to fill in missing values, smooth out noise, correct inconsistencies in the data. Cleaning techniques are binning, regression etc.,
• Data Integration and Transformation: Merging of data from multiple data sources, these sources may include multiple database, data cubes or flat files. Transformation is the process of consolidate the data into another form, it includes aggregation, generalization, normalization and attribute construction.
• Data Reduction: Techniques can be applied to obtain a reduced version of the data set that is much smaller in quantity but maintains the integrity of original data, which contains following strategies: data cube aggregation, attribute subset selection dimensionality reduction etc.,
• Concept hierarchy Generation: Concept hierarchies can be used to condense the data by collection and replacing low level concept with high level concept.
• Distributed Clustering: The Distributional clustering has been used to cluster words into groups based either on their participation in particular grammatical relations with other words by Pereira et al. or on the distribution of class labels associated with each word by Baker and McCallum. As distributional clustering of words are agglomerative in nature, and result in suboptimal word clusters and high computational cost, proposed a new information-theoretic divisive algorithm for word clustering and applied it to text classification.
2.1. Objective
The main aim of feature selection (FS) is to determine a minimal feature subset from a problem domain while retaining a suitably high accuracy in representing the original features. In real world problems FS is a must due to the abundance of noisy, irrelevant or misleading features. For instance, by removing these factors, learning from data techniques can benefit greatly. Feature selection is the mode of recognize the good number of features that fabricate well-suited outcome as the unique entire set of features. Feature Extraction is the special form of dimensionality reduction where feature selection is the subfield of feature extraction. Feature selection algorithms essentially have two basic criteria named as, time requirement and quality. Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features.
A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. The efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper.
3. Literature survey:
Literature survey is the most important step in software development process. Before developing the tool it is necessary to determine the time factor, economy n company strength. Once these things are satisfied, ten next steps are to determine which operating system and language can be used for developing the tool. Once the programmers start building the tool the programmers need lot of external support. This support can be obtained from senior programmers, from book or from websites. Before building the system the above consideration are taken into account for developing the proposed system. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.
Feature selection is similar to data preprocessing technique. It is an approach of identifying subset of features that are mostly related to target model. The main aim is to remove irrelevant and redundant features, it is also known as attribute subset selection. The purpose of feature selection is to increase the level of accuracy, condense dimensionality; shorter training time and enhances generalization by reducing over fitting. Feature selection techniques are a subset of the more general field of feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points).
According to D.W. Aha, “Feature Weighting for Lazy Learning Algorithms,”
Learning algorithms differ in the degree to which they process their inputs prior to their use in performance tasks. Many algorithms eagerly compile input samples and use only the compilations to make decisions. Others are lazy: they perform less recompilation and use the input samples to guide decision making. The performance of many lazy learners significantly degrades when samples are defined by features containing little or misleading information
7. CONCLUSION
This work motivates several directions for future research. In this paper, They have presented a novel clustering-based feature subset selection algorithm for high dimensional data. The algorithm involves
• removing irrelevant features,
• constructing a minimum spanning tree from relative ones, and
• partitioning the MST and selecting representative features.
In the proposed algorithm, a cluster consists of features. Each cluster is treated as a single feature and thus dimensionality is drastically reduced. Generally, the proposed algorithm obtained the best proportion of selected features, the best runtime, and the best classification accuracy confirmed the conclusions. They have presented a novel clustering-based feature subset selection algorithm for high dimensional data.
The algorithm involves removing irrelevant features, constructing a minimum spanning tree from relative ones, and partitioning the MST and selecting representative features. In the proposed algorithm, a cluster consists of features. Each cluster is treated as a single feature and thus dimensionality is drastically reduced. Generally, the proposed algorithm obtained the best proportion of selected features, the best runtime, and the best classification accuracy for Naive, and RIPPER, and the second best classification accuracy for IB1. The Win/Draw/Loss records confirmed the conclusions.