19-10-2012, 10:45 AM
Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme
Cosdes.pdf (Size: 1.43 MB / Downloads: 40)
Abstract
E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent
years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary
idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block
subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works
mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot
fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose a
novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the
e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate
phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection
System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update
scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live
data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is
applicable to the real world.
INTRODUCTION
E-MAIL communication is prevalent and indispensable
nowadays. However, the threat of unsolicited junk emails,
also known as spams, becomes more and more
serious. According to a survey by the website TopTenREVIEWS
[11], 40 percent of e-mails were considered as spams
in 2006. The statistics collected by MessageLabs1 show that
recently the spam rate is over 70 percent and persistently
remains high. The primary challenge of spam detection
problem lies in the fact that spammers will always find new
ways to attack spam filters owing to the economic benefits of
sending spams. Note that existing filters generally perform
well when dealing with clumsy spams, which have
duplicate content with suspicious keywords or are sent
from an identical notorious server. Therefore, the next stage
of spam detection research should focus on coping with
cunning spams which evolve naturally and continuously.
Related Works
Since the e-mail spam problem is increasingly serious
nowadays, various techniques have been explored to relieve
the problem. Based on what features of e-mails are being
used, previous works on spam detection can be generally
classified into three categories: 1) content-based methods,
2) noncontent-based methods, and 3) others. Initially,
researchers analyze e-mail content text and model this
problem as a binary text classification task. Representatives
of this category are Naive Bayes [14], [20] and Support
Vector Machines (SVMs) [1], [10], [15], [27] methods. In
general, Naive Bayes methods train a probability model
using classified e-mails, and each word in e-mails will be
given a probability of being a suspicious spam keyword. As
for SVMs, it is a supervised learning method, which
possesses outstanding performance on text classification
tasks. Traditional SVMs [10] and improved SVMs [1], [15],
[27] have been investigated. While above conventional
machine learning techniques have reported excellent results
with static data sets, one major disadvantage is that it is
cost-prohibitive for large-scale applications to constantly
retrain these methods with the latest information to adapt to
the rapid evolving nature of spams. The spam detection of
these methods on the e-mail corpus with various language
has been less studied yet. In addition, other classification
techniques, including markov random field model [3],
neural network [6] and logic regression [2], and certain
specific features, such as URLs [26] and images [19], [29]
have also been taken into account for spam detection.
Design of SpTable and SpTrees
One major focus of this work is to design the innovative
data structure to facilitate the process of near-duplicate
matching. SpTable and SpTrees (sp stands for spam) are
proposed to store large amounts of the e-mail abstractions
of reported spams. As shown in Fig. 4, several SpTrees are
the kernel of the database, and the e-mail abstractions of
collected spams are maintained in the corresponding
SpTrees. According to Definition 3, two e-mail abstractions
are possible to be near-duplicate only when the numbers of
their tags are identical. Thus, if we distribute e-mail
abstractions with different tag lengths into diverse SpTrees,
the quantity of spams required to be matched will decrease.
However, if each SpTree is only mapped to one single tag
length, it is too much of a burden for a server to maintain
such thousands of SpTrees. In view of this concern, each
SpTree is designed to take charge of e-mail abstractions
within a range of tag lengths. As can be seen in Fig. 4,
SpTable is created to record overall information of SpTrees.
PERFORMANCE EVALUATION
To assess the feasibility of system Cosdes,we conduct several
experiments to explore its efficiency and detection results.
The real spam data sets used in the experiments are from the
e-mail servers of Computer Center in National Taiwan
University, which has over 30,000 students. Since the ground
truth of real e-mail streams is unavailable, spams are
extracted from the well-known existing system, SpamAssassin.
3 Concerning hams, we not only include public data sets
(around 4,000 e-mails) provided by SpamAssassin,4 but also
obtain from volunteers. There are about 60,000 spams per day
and a set of 7,000 or so hams in the data set. Note that
numerous related works have evaluated the proposed
methods with static databases. However, to access the
performance of spam detection system with near-duplicate
matching scheme, real e-mail streams are more appropriate
than static data sets. Therefore, in this paper, we use
university-scale e-mail streams as the experimental data sets
to better simulate the e-mail environment. On the other hand,
three representative approaches [7], [24], [30] of nearduplicate
spam detection are employed for comparison.
The authors of [8], [17] also adopt the same e-mail
representation approach as in [7] but with different sharing
mechanisms. For ease of presentation, Damiani’s work is
abbreviated as Digest.
CONCLUSION
In the field of collaborative spam filtering by near-duplicate
detection, a superior e-mail abstraction scheme is required to
more certainly catch the evolving nature of spams. Compared
to the existing methods in prior research, in this paper,
we explore a more sophisticated and robust e-mail abstraction
scheme, which considers e-mail layout structure to
represent e-mails. The specific procedure SAG is proposed to
generate the e-mail abstraction using HTML content in
e-mail, and this newly-devised abstraction can more
effectively capture the near-duplicate phenomenon of
spams.