03-01-2013, 11:49 AM
Ontology Based Fuzzy Document Clustering Scheme
Ontology Based Fuzzy.pdf (Size: 232.16 KB / Downloads: 31)
Abstract
Document clustering is the technique used to group up the document with the reference to the similarity. It is
widely used in web mining and digital library environment. Documents are represented in vector space model.
Each document is a vector in the word space and each element of the vector indicates the frequency of the
corresponding word in the document. Documents are presented as high dimensional data elements. It is a very
complex task to cluster documents using K-means clustering algorithm. The sub space clustering schemes can be
adopted to cluster documents. The document clustering uses the term weights from the similarity measure. The
sub space model uses the relevant attributes for the similarity estimation. The fuzzy logic is used to cluster the
documents. The fuzzy document clustering scheme is enhanced with semantic analysis mechanism. Semantic
analysis is carried out with the support of the ontology. The ontology is used to maintain term relationships.
Term relationships are represented using the synonym, meronym and hypernym factors. Ontology is manually
collected by the users. Domain based ontology is used for the document clustering process. The system uses the
data mining domain based ontology for the semantic analysis. Semantic weights are used in the similarity
measure. Fuzzy based text document clustering scheme uses the stop word filters and stemming process under
the document preprocess. Term clustering and semantic clustering operations are performed in the system.
Introduction
Document clustering has been studied intensively because of its wide applicability in areas such as web mining
and information retrieval. In document clustering, unlabeled documents are typically represented in vector space
model (VSM), where each document is a vector in the word space and each element of the vector indicates the
frequency of the corresponding word (also called term or feature) in the document. Generally, the data are of
very high dimensional and sparse, which poses a big challenge to conventional clustering algorithms such as
k-means (S.B.Kotsiantis and P.E.Pintelas. 2004).In high dimensional data, clusters often exist in subspaces rather
than in the entire space (L.Jing, M.K.Ng, and J.Z.Huang. 2007). For example, in document clustering, clusters of
documents of different topics are categorized by different subsets of keywords. Moreover, the keywords for one
cluster may not occur in the documents of other clusters. One solution to this problem is text subspace clustering
(L.Jing, M.K.Ng, J.Xu, and J.Z.Huang. 2005), which aims to discovering the document clusters in different
subspaces of the original word space. In the past few years, soft subspace clustering algorithms have been
developed and successfully applied to clustering large document collections. Examples includes LAC
(C.Domeniconi, D.Gunopulos, and S.Ma. 2006), FWKM (L.Jing, M.K.Ng, J.Xu, and J.Z.Huang. 2005), [5] and
EWKM (L.Jing, M.K.Ng, and J.Z.Huang. 2007) etc. In these algorithms, each term is assigned with a desired set
of weighting values to distinguish its different contributions to document categories. Since the weighting values
are ranged between 0 and 1, the subspaces discovered by these algorithms are of soft. With k-means type
methods (L.Jing, M.K.Ng, J.Xu, and J.Z.Huang. 2005), (C.Domeniconi, D.Gunopulos, and S.Ma. 2006), (L.Jing,
M.K.Ng, and J.Z.Huang. 2007), (J.Z.Huang, M.K.Ng, H.Rong, and Z.Li. 2005), the algorithms iteratively group
the documents into hard partitions.
The R-FPC Algorithm
Given a vector space model, the documents vectors may be presented by x1, x2,. . . ,xn, where xi=(xi1, xi2,. . . ,xid)
and d stands for the number of unique words in the model, n denotes the total number of documents, xij is the
normalized word frequency of the jth term in the document. We also call xi a data point in the d-dimensional
space. Let {C1, C2…. CK} be the K document clusters, where Ck denotes a partition of document collections.
The membership of xi to Ck is denoted as uki.
In text subspace clustering, each category of documents is characterized by a subset of terms in the vocabulary
that corresponds to a subset of dimensions in the data space. In this sense, we say that a cluster of documents is
situated in a subspace of the original space. It is clear that a term may play unequally important roles to all the
clusters. To measure such special correlations, an individual weighting value wkj that ranges in [0,1] is assigned
to jth(j=1,2,. . . ,d) term of cluster Ck(k=1,2,. . . ,K), indicating how much the term is relevant to the cluster, with
of more relevance, and larger weight.
Proposed System
The proposed system is designed to perform the document clustering using the semantic analysis mechanism. The
ontology is used for semantic analysis. The fuzzy logic technique is used for the clustering process. The fitness
analysis is performed to verify cluster accuracy. The sub space clustering scheme is used in the system. The
document attributes are collected and grouped with relevancy. The similarity measurement is estimated on the sub
space model. The sub space similarity model reduces the computation complexity and increases the accuracy. The
sub space model also reduces the process time.
The clustering system is developed as a stand alone tool. The document preprocessing and clustering operations
are handled by the system. The system uses the text documents for the clustering process. The text documents are
collected from the benchmark datasets provided in UCI machine learning repository. The system is divided into
four major modules. They are Document preprocessing, Term cluster, Semantic Cluster and Performance analysis.
The document-preprocessing module is designed to convert the documents into structured data sets format. The
term cluster module is used to perform the document clustering using the term weights. The semantic clustering
module is designed to cluster the documents using semantic weights. The performance analysis module is
designed to analyze the cluster accuracy and process time. The system uses the Oracle relational database system
as back end.
Document Preprocess
The documents are maintained in text file format. The contents of the documents are parsed and converted into
the vector space model. The stop word elimination and stemming process are used to reduce the vector size. The
system maintains a stop word repository. The stop words in the documents are removed using the repository. The
stemming process analyzes the suffix value for the terms. The base term is extracted using the stemming process.
The porter-stemming algorithm is used in the system. The document details are updated into the database. The
system also updates the term list into the database.
Term Cluster
The system performs two types of clustering operations. They are term clustering and the semantic clustering.
The term clustering task is performed using the term weights. The term frequency is estimated and updated into
the database. The term frequency and inverse document frequency are calculated for each term. The term
weights are used for the similarity measurement process. The fuzzy clustering scheme is applied on the sub space
of the term collection. The term weights are used for the comparison process. The term cluster requires high
vector size for the clustering process.
Semantic Cluster
The semantic clustering is performed with the term relationship based comparison. The term cluster does not
consider the term relationship. The semantic cluster uses the term relationship for the clustering process. The
ontology is used to maintain the relationship for the term collection in a domain. The terms are maintained with
synonym, meronym and hypernym relationships. The terms are analyzed with the ontology collections. The term
category is used for the weight estimation process. The semantic weight is estimated for each concept. The
clustering process uses the semantic weights.
Performance Analysis
The performance analysis module is designed to analyze the performance of the term clustering and semantic
clustering techniques (Figure1). The memory, process time and accuracy metrics are used for the performance
analysis. The memory requirement for each clustering is analyzed. The accuracy is estimated using the fitness
function.
Experiments and Performance Results
The text documents are denoted as unstructured databases. It is very complex to group the text documents. The
document clustering requires a preprocessing task to convert the unstructured data values into a structure one.
The documents are large dimensional data elements. The dimension is reduced using the stop word elimination
and stemming process. The ontology fuzzy document is the process of extracting the frequent and popular
contents of the text document collection. The document grouping tasks require the content relationship factors.
The semantic analysis is the technique that uses the term and its relationship with a collection of terms. The
relationships are represented as synonym, meronym and hypernym. The system is implemented to perform fuzzy
text document grouping with the support of semantic analysis. Table1 shows the analysis of term cube versus
semantic cube. The benchmark document collection is selected as the testing environment for the system.
The system is tested with benchmark document collection from 20 newsgroup dataset. Initially the documents
are updated to the database with preprocessed information. The stopword elimination and stemming operations
are performed in the preprocessor. All the document analysis operations are carried out on the database
information. The porter-stemming algorithm is used in the system. The system is implemented to perform fuzzy
text document grouping with the support of semantic analysis. Table1 shows the Memory Usage Analysis-3
Clusters K-means Vs Fuzzy, Table2 shows the process Time analysis-3 Clusters K-means Vs Fuzzy and Table3
shows the Fitness Point Analysis-3 Clusters K-means Vs Fuzzy.
Conclusion
Text clustering is about discovering novel, interesting and useful patterns from textual data. In this paper we
have discussed how to introduce the method of building ontologies into unsupervised text learning in order to
consider the text semantics in the preview of linguistics. The fuzzy document clustering uses the sub
space-clustering model. The relevant attributes are used for the comparison process. The semantic analysis is
used to reduce the vector size. The relevancy is also improved by the semantic analysis. The system can be
enhanced with multi domain ontology to analyze documents with any domain. This also applied to distribute
clustering on web document and in XML document. In Future work will consider the fuzzy clustering scheme
under the direction of ontologies, after all, most of the documents simultaneously belong to more than one
category. Furthermore, the method of calculating the term mutual information in this paper can be used to create
the ontology in different field.