03-09-2012, 02:18 PM
Email Classification Using Data Reduction Method
Email Classification.pdf (Size: 295.38 KB / Downloads: 50)
Abstract
Classifying user emails correctly from penetration of
spam is an important research issue for anti-spam researchers.
This paper has presented an effective and efficient email
classification technique based on data filtering method. In our
testing we have introduced an innovative filtering technique
using instance selection method (ISM) to reduce the pointless
data instances from training model and then classify the test
data. The objective of ISM is to identify which instances
(examples, patterns) in email corpora should be selected as
representatives of the entire dataset, without significant loss of
information. We have used WEKA interface in our integrated
classification model and tested diverse classification algorithms.
Our empirical studies show significant performance in terms of
classification accuracy with reduction of false positive instances.
INTRODUCTION
The Internet is becoming an integral part of our everyday
life and the email has treated a powerful tool intended to be
an idea and information exchange, as well as for users’
commercial and social lives. Due to the increasing volume
of unwanted email called as spam, the users as well as
Internet Service Providers (ISPs) are facing multifarious
problems. Email spam also creates a major threat to the
security of networked systems. Email classification is able
to control the problem in a variety of ways. Detection and
protection of spam emails from the e-mail delivery system
allows end-users to regain a useful means of communication.
Many researches on content based email classification have
been centered on the more sophisticated classifier-related
issues [10]. Currently, machine learning for email
classification is an important research issue. The success of
machine learning techniques in text categorization has led
researchers to explore learning algorithms in spam filtering
[1, 2, 3, 4, 10, 11, 13, and 14]. However, it is amazing that
despite the increasing development of anti-spam services
and technologies, the number of spam messages continues to
increase rapidly.
RELATED WORKS
In recent years, many researchers have turned their
attention to classification of spam using many different
approaches. According to the literature, classification
method is considered one of the standard and commonly
accepted methods to stop spam [10]. This method is
effective for the currently encountered types of spam. The
philosophy behind this method is to separate the spam from
legitimate emails. The classification approaches can be
broadly separated into two different categories. One is
based on non-classification algorithms and other is based on
classification algorithms.
Non-Classification algorithms
Non-classification based methods include heuristic or
rule-based methods, white-listing, black-listing, hash-based
lists and distributed black-lists. Non-classification based
solutions work well because of their simplicity and
relatively short processing time [15]. Another key attraction
is that it does not require a training period. However, in the
context of new filtering technologies and in the light of
current spamming techniques, it has several drawbacks.
Since these methods are based on standard rule sets
CONCLUSION AND FUTURE WORK
This paper presents and effective email classification
technique based on an innovative data filtering technique
into the training model. In our data filtering process, we have
used cluster classifier technique to reduce the insignificant
instances from our training model. After investigation of
different classification algorithms, we have chosen five
classifiers based on our simulation performance and we have
used meta-learning technique (Adaboost) on top of every
classifier. Our empirical performance shows that, we
achieved overall classification accuracy above 97%, which is
significant. In our future work we have a plan to consider the
features from dynamic information from regular incoming
emails and pass to our classification method to achieve better
performance.