07-09-2012, 10:42 AM
Study Search-Based Semi-Supervised Clustering Algorithms
1Study Search.doc (Size: 1.21 MB / Downloads: 32)
INTRODUCTION
Knowledge Discovery Process
In statistical and data analysis, it is essential that the business analyst knows, what the variables are before analysis can start. The business analyst needs knowledge discovery technology and tools.
Knowledge discovery has its roots in artificial intelligence and machine learning. Some of the definitions of knowledge discovery are described in the following list: Knowledge discovery may be a nontrivial extraction of implicit, previously unknown and potentially useful information from the data.
Knowledge discovery may be the data search process, without stating in advance a hypothesis or question and still finding either unexpected or interesting information in relationships and patterns among its data elements or important business rules in the full data searched and analyzed.
Knowledge discovery may mean to uncover previously unknown business facts in the gigabytes of data in the data warehouse or data mart.
Business managers and analysts are always seeking new and additional business insights so that crucial business decisions, which have significant impact on the health of a business, can be improved. Using the traditional techniques of business queries and data analysis requires asking the right questions.
Text Mining
Text mining is the process of extracting important information and knowledge from unstructured text. Text Mining is a field that is at the intersection of many other research fields, including, but not limited to Data Mining (DM), Knowledge Discovery from (KDD), Information Extraction (IE), Information Retrieval (IR), and Databases.
The main difference between the text mining and the data mining is that data mining tools are designed to deal with structured data from databases or XML-based models. However, text mining deals with unstructured or semi-structured data such as text documents, HTML models, and emails. Thus, text mining is a much generalized solution for text, where large volumes of different types of information should be managed and merged.
Text mining attempts to discover new, previously unknown information by applying techniques from natural language processing and data mining.
SYSTEM ANALYSIS
Existing System
In existing system approaches for clustering data are based on metric similarities, i.e., nonnegative, symmetric, and satisfying the triangle inequality measures. For text mining tasks, the majority of state-of-the-art frameworks employ the vector space model (VSM), which treats a document as a bag of words and uses plain language words as features. This model can represent the text mining problems easily and directly. However, with the increase of data set size, the vector space becomes high dimensional, sparse, and the computational complexity grows exponentially.
Supervised Learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier or a regression function. The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a reasonable way. The parallel task in human and animal psychology is often referred to as concept learning.
Proposed System
Semi-supervised learning has captured a great deal of attentions. Semi-supervised learning is a machine learning paradigm in which the model is constructed using both labeled and unlabeled data for training, typically a small amount of labeled data and a large amount of unlabeled data.
In clustering process, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training, typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning and supervised learning. Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent to manually classify training examples. The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value.
JAVA SCRIPT
Java Script is an interpreted programming language with object oriented capabilities. The general purpose of the language has been embedded in Netscape Navigator, Internet Explorer and other web browsers. The client side version of the JavaScript allows executable content to be included in web pages it means that a web page need no longer be static HTML but can include programs that interact with the user, control the browser and dynamically create HTML content. JavaScript is an un typed language that means that variables do not need to have a type specified. JavaScript resembles C, C++, Java with programming constructs such as IF statement, WHILE loop etc. JavaScript was originally called Live Script and its name was changed to JavaScript and it is purely a marketing strategy.
Conclusion
The main goal in the proposed thesis is to study search-based semi-supervised clustering algorithms and apply them to cluster the documents. How supervision can be provided to clustering in the form of labeled data points or pairwise constraints how informative constraints can be selected in an active learning framework for the pairwise constrained semi-supervised clustering model and how search based and similarity-based techniques can be unified in semi-supervised clustering. In work so far, have mainly focused on generative clustering models, e.g. KMeans and EM, and ran experiments on clustering low-dimensional UCI datasets or high-dimensional text datasets. In this thesis, want to study other aspects of semi-supervised clustering, like the effect of noisy, probabilistic or incomplete supervision in clustering; model selection techniques for automatic selection of number of clusters in semi-supervised clustering; ensemble semi-supervised clustering. In future, want to study the effect of semi-supervision on other clustering algorithms, especially in the discriminative clustering and online clustering framework also want to study the effectiveness of semi-supervised clustering algorithms on other domains, e.g., web search engines, astronomy and bioinformatics.