26-11-2012, 05:24 PM
A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME
ABSTRACT
E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed.
The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams.
On purpose of achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a succinct Abstraction derived from e-mail content text. However, these Abstractions of e-mails cannot fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection.
In this paper, we propose a novel e-mail Abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail Abstraction using HTML content in e-mail, and this newly devised Abstraction can more effectively capture the near-duplicate phenomenon of spams.
Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme.
The progressive update scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is applicable to the real world.