02-01-2013, 10:38 AM
Privacy-Aware Collaborative Spam Filtering
Abstract
While the concept of collaboration provides a natural defense against massive spam e-mails directed at large numbers of
recipients, designing effective collaborative anti-spam systems raises several important research challenges. First and foremost, since
e-mails may contain confidential information, any collaborative anti-spam approach has to guarantee strong privacy protection to the
participating entities. Second, the continuously evolving nature of spam demands the collaborative techniques to be resilient to various
kinds of camouflage attacks. Third, the collaboration has to be lightweight, efficient, and scalable. Toward addressing these
challenges, this paper presents ALPACAS—a privacy-aware framework for collaborative spam filtering. In designing the ALPACAS
framework, we make two unique contributions. The first is a feature-preserving message transformation technique that is highly
resilient against the latest kinds of spam attacks. The second is a privacy-preserving protocol that provides enhanced privacy
guarantees to the participating entities. Our experimental results conducted on a real e-mail data set shows that the proposed
framework provides a 10 fold improvement in the false negative rate over the Bayesian-based Bogofilter when faced with one of the
recent kinds of spam attacks. Further, the privacy breaches are extremely rare. This demonstrates the strong privacy protection
provided by the ALPACAS system.
INTRODUCTION
STATISTICAL filtering (especially Bayesian filtering) has long
been a popular anti-spam approach, but spam continues
to be a serious problem to the Internet society. Recent spam
attacks expose strong challenges to the statistical filters,
which highlights the need for a new anti-spam approach.
The economics of spam dictates that the spammer has to
target several recipients with identical or similar e-mail
messages. This makes collaborative spam filtering a natural
defense paradigm, wherein a set of e-mail clients share their
knowledge about recently receivedspame-mails, providing a
highly effective defense against a substantial fraction of spam
attacks. Also, knowledge sharing can significantly alleviate
the burdens of frequent training stand-alone spam filters.
However, any large-scale collaborative anti-spam approach
is faced with a fundamental and important challenge,
namely ensuring the privacy of the e-mails among untrusted e-mail
entities. Different from the e-mail service providers such as
Gmail or Yahoo mail, which utilizes spam orham(non-spam)
classifications from all its users to classify new messages,
privacy is a major concern for cross-enterprise collaboration,
especially in a large scale. The idea of collaboration implies
that the participating users and e-mail servers have to share
and exchange information about the e-mails (including the
classification result).
MOTIVATION AND PRIOR WORK
Researchers have proposed many spam resistance approaches
including white and black lists [5], statistical
filtering [6], network analysis [7], [8], and sender authentication
[9]. A single commercial product often employs
many of these approaches concurrently.
Limitations of Statistical Filtering Techniques
Statistical filtering is currently the predominant anti-spam
approach. The central idea of all statistical filters is to assign
each word (more generally token) with a spam likelihood
value and a ham likelihood value and classify e-mails based
on the likelihood values of the words appearing in them.
Naive Bayesian classifier, which is a popular machine
learning-based statistical filter, generates the spam and ham
likelihood values of the tokens based on the statistics of
their appearances in a set of training data. For each newly
arriving message, this technique calculates a score based on
the spam and ham likelihood values of its tokens, which is
then used for classifying the message.
With significant amount of research efforts devoted to
improving its accuracy, statistical filters have been reasonably
successful in filtering traditional types of spam
messages when they are trained with sufficient data.
However, these stand-alone statistical filters suffer from
two major limitations. First, statistical filters are highly
vulnerable to a class of attacks that are intended to
confuse them by appending ham-like material or reducing
the spam words in the e-mails. For example, in the good
word attack, the spammer appends large numbers of good
words (those that appear mostly in ham messages) to the
end of spam e-mails, thereby misleading the statistical
filters to classify them as ham. Similarly, Picospams are
extremely small e-mail messages, and they hardly contain
any word that can be used by statistical filters for
classification.
Privacy-Aware Data Management
Recently, there has been considerable research on privacy
and trust issues in data management [22], [23], [24], [25],
[26]. Data perturbation [22] and data anonymization [27],
[23], [24] are the two basic approaches for ensuring privacy
of relational data. Researchers have also proposed various
privacy-aware schemes for sharing information among
independent databases [28]. Further, the problems of
privacy-preserving query computation and data mining
have also received considerable research attention [29], [30].
However, most of these schemes cannot be used for
collaborative spam filtering application as the underlying
data is essentially textual in nature.
Privacy-Preserving Collaboration Protocol
Feature-preserving fingerprint is just one level of privacy
protection, the amount of information exchanged during
collaboration can be further controlled for stronger privacy
protection. In particular, we design the collaborative antispam
system equipped with privacy-aware message exchange
protocol based on the following spam/ham dichotomy
that revealing the contents of a spam e-mail does not affect the
privacy or confidentiality of the participants, whereas revealing
information about a ham e-mail constitutes a privacy breach.
Our protocol works as follows: When an agent EAj
receives a message Ma, EAj computes its TFSet :
TFSetðMaÞ. It then sends a query message to other e-mail
agents in the system to check whether they can provide
any information related to Ma. However, instead of
sending the entire TFSetðMaÞ as the query message to all
agents, EAj sends a small subset of TFSetðMaÞ to a few
other e-mail agents (the e-mail agents to which the query is
sent is determined on the basis of the underlying structure;
see Section 3.3). The subsets of TFSetðMaÞ included in the
queries sent to various other e-mail agents need not be the
same (our architecture optimizes the communication
costs by sending nonoverlapping subsets to carefully
chosen e-mail agents).
Distributed System Structure
As in many other distributed systems, the underlying
architecture has a strong inference on the efficiency,
scalability, and performance of the ALPACAS system.
However, an aspect that is unique to this problem is that
the underlying architecture can also have a significant
impact on the privacy of the participating e-mail agents. For
example, sending queries to too many e-mail agents
increases the risk of inference-based privacy breaches.
A naive approach for designing the ALPACAS system is
to use a flat and unstructured organization [35] in which
every agent maintains its knowledge base of the spam
information that it has received. In this case, an agent would
need to query all other collaborative agents to classify each
incoming message. Thus, the flat structure is inefficient and
unscalable. It also has high possibility of privacy breaches
as the FEs of a message are virtually delivered to every
participant in the system.
Resilience to the Compromises
It is possible for an attacker to intrude the ALPACAS system
and diminish its spam-filtering capability by deliberately
sending deceitful responses to queries or by simply not
responding to queries at all. It is necessary that ALPACAS
sustains resiliently to these attacks within a certain range
even if part of the participants are compromised.
In this section, we study two scenarios of such attacks:
1) Quiescent response, where the intruder compromises
participating entities and refuses to answer the queries
from peers and 2) Adverse response, where the intruder does
the compromises and adversely sends the matched records
back to the peers (i.e., sends ham records back to the query
for spam records or sends spam records to the query for
ham records).
CONCLUSION
In this paper, we have presented the design and evaluation
of ALPACAS, a privacy-aware collaborative spam filtering
framework that provides strong privacy guarantees to the
participating e-mail recipients. Our system has two novel
features: 1) a feature-preserving transformation technique
encodes the important characteristics of the e-mail into a set
of hash values such that it is computationally impossible to
reverse engineer the original e-mail and 2) a privacypreserving
protocol enables the participating entities to
share information about spam/ham messages while protecting
them from inference-based privacy breaches. Our
initial experiments show that the ALPACAS approach is
very effective in filtering spam, has high resilience toward
various attacks, and provides strong privacy protection to
the participating entities.