20-08-2012, 03:24 PM
Optimized near Duplicate Matching scheme for E-mail Spam Detection
Optimized-near-Duplicate.pdf (Size: 691.04 KB / Downloads: 55)
Abstract
Today the major problem that the people are facing is spam mails or e-mail spam. In recent years there are so many schemes are developed to detect the spam emails. Here the primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by users feedback, to block the subsequent near-duplicate spam’s. We propose a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for Collaborative Spam Detection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. To detect fastly near duplicates and duplicate spam mails in Cosdes , we propose a new approach SimHash.
Introduction
Internet is the most widely used area. In internet most widely used are E-mails. E-mails play a major role for the communication between the people .The people who are using emails cannot verify the duplicate and near duplicate web documents creating the more problems on the web search engines. These documents will increase the space required to store the index, slow down the searching results and the annoy users. According to the data availability on the internet, the huge data are shorts texts such that mobile phone short messages, instant messages, chat log, BBS titles etc.
Preliminaries
Near Duplicate
Near-duplicate spam detection is to exploit reported
spams and to subsequently block one which have similar
content. The definition of similarity between two e-mails are
diverse for different forms of email. representing e-mails
based mainly on content text, we represent e-mail using an
HTML tag sequence, which depicts the layout Structure of email,
and look forward to more effectively capturing the nearduplicate
phenomenon of spams.
Related Works
Since the e-mail spam problem is increasingly serious
various techniques have been explored to solve the problem.
They can be categorized into the categories: 1) content-based
methods,2) non content-based methods, and 3) others.
Researchers analyze e-mail content text and model this
problem as a binary text classification task. The solutions of
this category are Naive Bayes , and Support Vector Machines
(SVMs) methods. Naive Bayes methods train a probability
model using classified e-mails, and each word in e-mails will
be given a probability of being a suspicious spam keyword. As
for SVMs, it is a supervised learning method, which possesses
outstanding performance on text classification tasks. markov
random field model ,neural network and logic regression ,
and certain specific features, such as URLs and images have
also been taken into account for spam detection. The other
group attempts to exploit noncontent information such as email
header, e-mail social network, and e-mail traffic to filter
spams. Collecting notorious and innocent sender addresses (or
IP addresses) from e-mail header to create blocked list and
allowable mail list.
Challenges To Detect Spam
E-Mails In this day and age, spammers are becoming more and more sophisticated. They are finding ways to trick people into thinking their unsolicited junk messages are worth the time you spend reading them. While many users are savvy enough to figure out what’s real and what’s bogus among their electronic correspondence, there are many out there who take what they receive at face value and open it. This is alright though because sometimes the electronic junk mail swindlers are clever enough to pull the wool over our eyes. It’s in the best interests of your computer’s health and your sanity to research how to tell if an email is spam or genuine. We researched this topic extensively and generated a list of the top five ways to tell if an email is spam. These rules can help you when spam slips through the protection of your Spam filter.