15-01-2013, 02:15 PM
Correlation Based Multi-Document Summarization for Scientific Articles and News Group
Correlation Based Multi-Document.pdf (Size: 654.43 KB / Downloads: 43)
ABSTRACT
Automated information retrieval systems are used to reduce the overload of document retrieval. There is a need to provide high quality summary in order to allow the user to quickly locate the desired information. This paper proposes a new summarization technique which considers correlated concepts i.e. terms and related terms as concepts for concept based document summarization. Related documents are grouped into same cluster by Bisecting k-means clustering algorithm. From each cluster important sentences are extracted by concept matching and also based on sentence feature score. Also we adopt a modified redundancy elimination technique which is purely based on concepts rather than terms. Experiments are carried to analyze the performance of the proposed work with the existing term based and synonyms and hypernyms based summarization techniques considering scientific articles and news tracks as data set. From the analysis it is inferred that our proposed technique gives better enhancement for the documents related to scientific terms.
INTRODUCTION
Now-a-days online submission of documents has increased widely, which means large amount of documents are accumulated for a particular domain dynamically. Information retrieval [1] is the process of searching information within the documents. An information retrieval process begins when a user enters a query; queries are formal statements of information needs, for example search strings in web search engine. In the process of information retrieval, a query does not uniquely identify a single object in the collection. Instead, several objects may match the query perhaps with different degrees of relevancy. Hence user has to visit each and every page for the required information, which is time a consuming process.
RELATED WORK
AditiSharan, et.al [5] proposed a semantic based document clustering using Wordnet ontology. The main aim of this is to replace the words with possible concept. This technique takes the nouns from all the documents forming the master noun list. The depth of each word is calculated by weighing the words. Then all possible combination of words is created and the pairs below the threshold are deleted from the pair list. The semantic similarity measure is used to find the maximum similarity to replace the term with the concept and the documents are clustered based on extracted concepts. But the experimental result shows that it does not consider all possible conditions.
Anna Huang, et.al [6] proposed a document clustering technique based on concept extraction using semantic relations. This work computes the similarity measure between the terms instead of considering the overlap between the terms as in the previous work. This process is achieved in 3 steps: identifying candidate phrases in the document and mapping them to anchor text in Wikipedia; disambiguating anchors that relate to multiple concepts; and pruning the list of concepts to filter out those that do not relate to the document‟s central thread.
Clustering Algorithm
The extracted concepts are clustered by induced Bisecting
K-means algorithm [13]. The steps in basic Bisecting Kmeans
algorithm [14] starts by selecting the elements with
largest distance as seed clusters and other items are
assigned to the closest seed. Then the center for these two
seeds are calculated by weighted sum of all items needed
and this center is used to find the new seeds. This process is
repeated until two seeds meet the predefined precision. If
the seed size is larger than the predefined threshold then the
entire process is repeated and this forms the binary tree.