18-06-2013, 01:00 PM
A Novel Method of Significant Words Identification in Text Summarization
A Novel Method.pdf (Size: 1.05 MB / Downloads: 55)
Abstract
Text summarization is a process that reduces the
size of the text document and extracts significant sentences
from a text document. We present a novel technique for text
summarization. The originality of technique lies on
exploiting local and global properties of words and
identifying significant words. The local property of word
can be considered as the sum of normalized term frequency
multiplied by its weight and normalized number of
sentences containing that word multiplied by its weight. If
local score of a word is less than local score threshold, we
remove that word. Global property can be thought of as
maximum semantic similarity between a word and title
words. Also we introduce an iterative algorithm to identify
significant words. This algorithm converges to the fixed
number of significant words after some iterations and the
number of iterations strongly depends on the text document.
We used a two-layered backpropagation neural network
with three neurons in the hidden layer to calculate weights.
The results show that this technique has better performance
than MS-word 2007, baseline and Gistsumm summarizers.
INTRODUCTION
As the amount of information grows rapidly, text
summarization is getting more important. Text
summarization is a tool to save time and to decide about
reading a document or not. It is a very complicated task.
It should manipulate a huge quantity of words and
produce a cohesive summary. The main goal in text
summarization is extracting the most important concept
of text document. Two kinds of text summarization are:
Extractive and Abstractive. Extractive method selects a
subset of sentences that contain the main concept of text.
In contrast, abstractive method derives main concept of
text and builds the summarization based on Natural
Language Processing. Our technique is based on
extractive method. There are several techniques used for
extractive method. Some researchers applied statistical
criterions. Some of these criterions include TF/IDF (Term
Frequency-Inverse Document Frequency) [1].
TEXT SUMMARIZATION APPROACHES
Automatic text summarization dates back to fifties. In
1958, Luhn [6] created text summarization system based
on weighting sentences of a text. He used word frequency
to specify topic of the text document. There are some
methods that consider statistical criterions. Edmundson [7]
used Cue method (i.e. "introduction", "conclusion", and
"result"), title method and location method for
determining the weight of sentences. Statistical methods
suffer from not considering the cohesion of text.
PROPOSED TECHNIQUE
The goal in extractive text summarization is selecting
the most relevant sentences of the text. One of the most
important phases in text summarization process is
identifying significant words of the text. Significant
words play an important role in specifying the best
sentences for summary. There are some methods to
identify significant words of the text. Some methods use
statistical techniques and some other methods apply
semantic relations between words of the text to determine
significant words of text. Such as term frequency (TF),
similarity to title words, etc. each method has its own
advantages and disadvantages. In our work, a
combination of these methods is used to improve the
performance of the text summarization system. In this
way, we use the advantages of several techniques to make
text summarization system better. We use both statistical
criterions and semantic relations between words to
identify significant words of text. Our technique has five
steps: preprocessing, calculating words score, significant
words identification, calculating sentences score, and
sentence selection.
Preprocessing
The first step in text summarization involves preparing
text document to be analyzed by text summarization
algorithm. First of all we perform sentence segmentation
to separate text document into sentences. Then sentence
tokenization is applied to separate the input text into
individual words. Some words in text document do not
play any role in selecting relevant sentences of text for
summary, Such as stop words ("a", "an", the"). For this
purpose, we use part of speech tagging to recognize types
of the text words. Finally, we separate nouns of the text
document. Our technique works on nouns of text. In the
rest of the article we use "word" rather than "noun".
Calculating Words Score
After preparing input text for text summarization
process, it is time to determine words score to be used in
later steps. In this step we utilize combination of
statistical criterions and lexical cohesion to calculate text
words scores. Finding semantic relations between words
is a complicated and time consuming process. So, first of
all, we remove unimportant words. For this reason, we
calculate local score of word. If local score of a word is
less than the word_local_score_threshold, we will remove
that word. Word_local_score_threshold is the average of
all text words scores multiplied by a PF (a number in the
range of (0, 1) as a Pruning Factor in word selection). By
increasing PF, more words will be removed from text
document. In this way, the number of words decreases
and the algorithm gets faster. We calculate global score
for remaining words based on reiteration category of
lexical cohesion. Finally, we calculate words scores by
using local and global score of words. This step is
described in detail in three next sections.
CONCLUSION and FUTURE WORK
In this article, we proposed a new technique to
summarize text documents. We introduced a new
approach to calculate words scores and identify
significant words of the text. A neural network was used
to determine the style of human reader and to which
words and sentences the human reader deems to be
important in a text. The evaluation results show better
performance than MS-word 2007, GistSumm, and
baseline summarizers. In future work, we intend to use
other features, such as font based feature and cue-phrase
feature in words local score and calculate words scores
based on it. Also the sentence local score and global score
can be changed to reflect the reader's needs.