28-08-2014, 04:22 PM
Filtering Spam E-Mail with Generalized Additive
Neural Networks
Filtering.pdf (Size: 344.86 KB / Downloads: 168)
Abstract
—Some of the major security risks associated with
spam e-mail are the spreading of computer viruses and the
facilitation of phishing exercises. Spam is therefore regarded as
one of the prominent security threats in modern organizations.
Security controls, such as spam filtering techniques, have become
increasingly important to protect information and information
assets. In this paper the performance of a Generalized Additive Neural Network on a publicly available e-mail corpus
is investigated in the context of statistical spam filtering. The
neural network is compared to a Naive Bayesian classifier and a
Memory-based technique. Generalized Additive Neural Networks
have a number of advantages compared to neural networks in
general. An automated construction algorithm performs feature
and model selection simultaneously and produces results which
can be interpreted by a graphical method.
I. INTRODUCTION
The Internet has promoted new channels of communication
which allow an e-mail to be sent to people thousands of
kilometers away [1]. Sending messages by e-mail has a
number of advantages including high reliability, relatively low
transmission costs, generally fast delivery and the ability to
be automated [2]. These strengths enable almost free mass emailing which can reach out to hundreds of thousands of users
within seconds. Unfortunately, this freedom of communication
can be exploited. Over the last number of years a situation
has been reached where users’ mailboxes have been flooded
with unwanted messages. This deluge of spam has escalated
to the point where the viability of communication via e-mail
is threatened.
II. GENERALIZED ADDITIVE NEURAL NETWORKS
To arrive at a spam filter, a decision function f must be
obtained that classifies a given e-mail message m as spam (S)
or legitimate mail (L) [2]. If the set of all e-mail messages is
denoted by M, a search for the function f : M → {S, L} is
performed by supervised learning
III. LING-SPAM CORPUS COLLECTION AND
PREPROCESSING
The Ling-Spam collection used in this paper was compiled
by [10] and is a combination of messages obtained from
the Linguist list and spam e-mail messages. The former is
a moderated mailing list about the science and occupation
of linguistics. The corpus contains 2893 messages which
is partitioned into 2412 Linguist messages and 481 spam
messages. Cost-sensitive evaluation metrics were introduced
by [10] to get an objective picture of the performance of the
Naive Bayesian algorithm. These metrics are also utilized in
this paper since the cost of misclassification differs for the two
classes (spam and legitimate).
V. MEMORY-BASED CLASSIFICATION
Memory-based (instance-based) methods [24] store all training instances in a memory structure and use them directly
for classification. The multi-dimensional space defined by
the attributes in the instance vectors constitutes the simplest
form of memory structure. Each training instance vector is
represented in this space by a point. A variant of the simple
k-nearest-neighbour (k-nn) algorithm is normally employed by
the classification procedure. This algorithm considers the k
training instances (its k-neighbourhood) closest to the unseen
instance and assigns the majority class among these instances
to the new unseen instance.
The memory-based classification algorithm implemented in
the Tilburg Memory-Based Learner (TiMBL) open source
software package was utilized by [10]
VI. CLASSIFICATION PERFORMANCE MEASURES
Classification performance is frequently measured in terms
of accuracy (Acc) or error rate (Err = 1 − Acc) [10].
Suppose NL and NS denote the total number of legitimate
and spam messages respectively to be classified by the filter
and nA→B the number of messages that belongs to category
A that the filter classifies as belonging to category B with
(A, B) ∈ {L, S}. Accuracy and error rate can then be defined
as
Acc = nL→L + nS→S
NL + NS
and Err = nL→S + nS→L
NL + NS
.
The measurements for accuracy and error rate assign equal
weights to the two types of error (L → S and S → L), but
L → S is λ times more costly than S → L. For evaluation
purposes, each legitimate message is handled as if it were
λ messages to make the accuracy and error rate sensitive to
the difference in cost. Therefore, when a legitimate message
goes successfully through the filter it counts as λ successes
and when it is blocked, it counts as λ errors. As a result,
the following definitions of weighted accuracy (W Acc) and
weighted error rate (WErr = 1 − W Acc) can be given:
W Acc = λ · nL→L +
VII. EXPERIMENTAL RESULTS
As in [10] three experiments were performed by the AutoGANN system. These experiments correspond to the three
scenarios for the λ parameter described in Section IV. The
number of selected attributes ranged from 25 to 100 in steps of
25 for each scenario1. The AutoGANN system also performed
feature selection in addition to that performed as part of
the preprocessing step.
A. Scenario 1: Labelling spam messages (λ = 1)
In the first scenario the misclassification cost is the same
for both error types. Figures 1 and 2 show the corresponding
results.3 All three learning techniques improve significantly on
the baseline (TCR = 1.0) and produce very accurate results.
AutoGANN outperforms the other two techniques by a large
margin over the interval [50, 100] (Figure 2).
On an individual basis, the techniques performed as follows.
The Naive Bayesian classifier achieves the best results for
100 attributes, while TiMBL performs best with a smaller
attribute set size of 50 (Figure 1). AutoGANN does best for
100 attributes. Three different values of k (1, 2, and 10) were
chosen to evaluate TiMBL’s performance. From Figure 1 it
seems the method performs best for small values of k. For
k = 10 the method improves only slightly on the base case.
This can be ascribed to the large number of ties for each of the
k = 10 distances which leads to a very large neighbourhood
taken into consideration (> 500 neighbours). For these cases
the filter approximates the default rule which classifies all
messages according to the majority class (legitimate in the case
of the Ling-Spam corpus). This also explains the insensitivity
of the method to the number of attributes for k = 10.
Compared to the other three methods, Outlook’s keyword
patterns perform very poorly.
B. Scenario 2: Notifying senders about blocked messages
(λ = 9)
imate messages is increased by setting λ = 9. Figures 3
and 4 show the corresponding results. The most important
difference compared to the first scenario is less improvement
of the learning techniques over the baseline. As λ increases
the performance of the learning techniques relative to the
baseline decreases. This is due to the fact that without a
filter all legitimate messages are retained which becomes more
beneficial as λ increases. Consequently, it becomes harder to
“beat” the baseline. As in the first scenario, AutoGANN is
clearly superior compared to the other two classifiers (Figure
4). From Figure 3 it can be observed that Outlook’s patterns
perform below the base case (TCR < 1.0). This suggests one
is better off not utilizing the Outlook filter.
For the second scenario the cost of misclassifying legit
C. Scenario 3: Removing blocked messages (λ = 999)
For the last scenario a large λ value is used. From Figures
5 and 6 it can be seen that the performance of the learning
techniques decreases to such a level that any improvement on
the baseline is very hard to achieve. Accordingly, the choice
to use any filter at all becomes doubtful. Note that this high
λ value was proposed by [23]. TiMBL with k = 10 is one
exception by performing consistently higher than the base case
by a small margin. This is again influenced by the very large
neighbourhood, which classifies most messages as legitimate,
due to the large value of λ. The Naive Bayesian classifier
delivers better performance with 300 attributes, but this is
the only point where it outperforms the baseline. Locating
the optimal attribute size exactly is infeasible in practical
applications and therefore TiMBL for k = 2 is the preferred
choice. Abrupt fluctuations in the performance of the learning
methods can be ascribed to the misclassification of a legitimate
message which causes a very large slump in TCR. This effect
can be observed, for example, with TiMBL for k = 2 and 550
VIII. CONCLUSIONS
Spam has become one of the major risk security threats
in modern organisations and in order to protect enterprises
against risks such as computer viruses, phishing, overloading,
unnecessary cost etc., efficient security controls, such as spam
filters, are necessary.