Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

**seminar tips** · 28-11-2012, 05:09 PM

Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

.pdf

Supervised and Traditional Term Weighting.pdf (Size: 2.49 MB / Downloads: 35)

INTRODUCTION

TEXT categorization (TC) is the task of automatically
classifying unlabelled natural language documents into
a predefined set of semantic categories. As the first and a
vital step, text representation converts the content of a
textual document into a compact format so that the
document can be recognized and classified by a computer
or a classifier. In the vector space model (VSM), the content
of a document is represented as a vector in the term space,
i.e., d ¼ ðw1; . . . ; wkÞ, where k is the term (feature) set size.
Terms can be at various levels, such as syllables, words,
phrases, or any other complicated semantic and/or syntactic
indexing units used to identify the contents of a text.
Different terms have different importance in a text, thus an
important indicator wi (usually between 0 and 1) represents
how much the term ti contributes to the semantics of
document d. The term weighting method is such an
important step to improve the effectiveness of TC by
assigning appropriate weights to terms.

TERM WEIGHTING METHODS: A BRIEF REVIEW

In text representation, terms are words, phrases, or any
other indexing units used to identify the contents of a text.
However, no matter which indexing unit is in use, each
term in a document vector must be associated with a value
(weight), which measures the importance of this term and
denotes how much this term contributes to the categorization
task of the document. In this section, we review a
number of traditional term weighting methods and thestate-
of-art supervised term weighting methods.

Three Factors for Term Weighting Assignment

Salton and Buckley [11] discussed three considerations of
the assignment of appropriately weighted single term in IR
field. First, term occurrences ðtfÞ in one document appear to
closely represent the content of the document. Second, the
term frequency alone may not have the discriminating
power to pick up all the relevant documents from other
irrelevant documents. Therefore, an idf factor has been
proposed to increase the term’s discriminating power for IR
purpose. In general, the two factors, tf and idf, are
combined by a multiplication operation and are thought
to improve both recall and precision measures. Third, to take
the effect of document length into consideration, a cosine
normalization factor is incorporated to equalize the length
of the documents.

Traditional Term Weighting Methods

As we mentioned before, the traditional term weighting
methods for TC are usually borrowed from IR and belong to
the unsupervised term weighting methods. The simplest
one is binary representation. The most popular one is tf:idf
proposed by Jones (first appearing in [15] and reprinted in
[16]). Note that the tf here also has various variants such as
raw term frequency, logðtfÞ, logðtf þ 1Þ, or logðtfÞ þ 1.
Besides, the idf factor (usually computed as logðN=niÞ)
also has a number of variants such as logðN=ni þ 1Þ,
logðN=niÞ þ 1, and logðN=ni 1Þ (i.e., idf prob), etc.
Interestingly, tf:idf has a variant known as BM25 (see
[17]) that appears in IR literature. To understand BM25, let
us first take a look at Robertson and Sparck Jones (RSJ). RSJ
(see [18]) is also known as the relevance weighting from the
classical probabilistic model for IR.

A New Supervised Term Weighting
Scheme—tf:rf

The basic idea of our intuitive consideration is quite simple:
the more concentrated a high-frequency term is in the
positive category than in the negative category, the more
contributions it makes in selecting the positive samples
from the negative samples.
Although the above supervised term weighting factors
take the document distribution into account, they are not
always consistent with the above intuitive consideration.
Specifically, several supervised term weighting factors
discussed in the previous section are symmetric in terms
of positive and negative categories.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	OBJECT ORIENTED ANALYSIS AND DESIGN TWO MARK AND SIXTEEN MARK Q and A	seminar ideas	1	1,982	29-08-2017, 11:23 AM Last Post: jaseela123
	Short Term Load Forecasting with Fuzzy Logic Systems Report	seminar flower	1	963	28-08-2017, 05:00 PM Last Post: jaseela123
	Application of real-time operating system QNX for automatic determination of dynamic	computer science crazy	0	6,557,134	25-08-2017, 09:32 PM Last Post: computer science crazy
	Automatic Test Case Generation Using Message Sequence Charts(MSCs).	nit_cal	0	9,495,161	25-08-2017, 09:32 PM Last Post: nit_cal
	3GPP long term evolution LTE	Electrical Fan	0	14,093,890	25-08-2017, 09:32 PM Last Post: Electrical Fan
	Integrating Structural Design and Formal Methods in RealTime System Design	computer science crazy	0	11,008,207	25-08-2017, 09:32 PM Last Post: computer science crazy
	Automatic License Plate Recognition System Based on Color Image Processing	dhanabhagya	0	335	09-01-2016, 03:06 PM Last Post: dhanabhagya
	Text Mining	presentation Abstract	0	480	10-06-2015, 03:03 PM Last Post: presentation Abstract
	Text Mining	presentation Abstract	0	404	21-05-2015, 03:41 PM Last Post: presentation Abstract

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.