28-11-2012, 05:09 PM
Supervised and Traditional Term Weighting Methods for Automatic Text Categorization
Supervised and Traditional Term Weighting.pdf (Size: 2.49 MB / Downloads: 35)
INTRODUCTION
TEXT categorization (TC) is the task of automatically
classifying unlabelled natural language documents into
a predefined set of semantic categories. As the first and a
vital step, text representation converts the content of a
textual document into a compact format so that the
document can be recognized and classified by a computer
or a classifier. In the vector space model (VSM), the content
of a document is represented as a vector in the term space,
i.e., d ¼ ðw1; . . . ; wkÞ, where k is the term (feature) set size.
Terms can be at various levels, such as syllables, words,
phrases, or any other complicated semantic and/or syntactic
indexing units used to identify the contents of a text.
Different terms have different importance in a text, thus an
important indicator wi (usually between 0 and 1) represents
how much the term ti contributes to the semantics of
document d. The term weighting method is such an
important step to improve the effectiveness of TC by
assigning appropriate weights to terms.
TERM WEIGHTING METHODS: A BRIEF REVIEW
In text representation, terms are words, phrases, or any
other indexing units used to identify the contents of a text.
However, no matter which indexing unit is in use, each
term in a document vector must be associated with a value
(weight), which measures the importance of this term and
denotes how much this term contributes to the categorization
task of the document. In this section, we review a
number of traditional term weighting methods and thestate-
of-art supervised term weighting methods.
Three Factors for Term Weighting Assignment
Salton and Buckley [11] discussed three considerations of
the assignment of appropriately weighted single term in IR
field. First, term occurrences ðtfÞ in one document appear to
closely represent the content of the document. Second, the
term frequency alone may not have the discriminating
power to pick up all the relevant documents from other
irrelevant documents. Therefore, an idf factor has been
proposed to increase the term’s discriminating power for IR
purpose. In general, the two factors, tf and idf, are
combined by a multiplication operation and are thought
to improve both recall and precision measures. Third, to take
the effect of document length into consideration, a cosine
normalization factor is incorporated to equalize the length
of the documents.
Traditional Term Weighting Methods
As we mentioned before, the traditional term weighting
methods for TC are usually borrowed from IR and belong to
the unsupervised term weighting methods. The simplest
one is binary representation. The most popular one is tf:idf
proposed by Jones (first appearing in [15] and reprinted in
[16]). Note that the tf here also has various variants such as
raw term frequency, logðtfÞ, logðtf þ 1Þ, or logðtfÞ þ 1.
Besides, the idf factor (usually computed as logðN=niÞ)
also has a number of variants such as logðN=ni þ 1Þ,
logðN=niÞ þ 1, and logðN=ni 1Þ (i.e., idf prob), etc.
Interestingly, tf:idf has a variant known as BM25 (see
[17]) that appears in IR literature. To understand BM25, let
us first take a look at Robertson and Sparck Jones (RSJ). RSJ
(see [18]) is also known as the relevance weighting from the
classical probabilistic model for IR.
A New Supervised Term Weighting
Scheme—tf:rf
The basic idea of our intuitive consideration is quite simple:
the more concentrated a high-frequency term is in the
positive category than in the negative category, the more
contributions it makes in selecting the positive samples
from the negative samples.
Although the above supervised term weighting factors
take the document distribution into account, they are not
always consistent with the above intuitive consideration.
Specifically, several supervised term weighting factors
discussed in the previous section are symmetric in terms
of positive and negative categories.