08-11-2012, 02:04 PM
Detection of Phishing Attacks: A Machine Learning Approach
DetectionOfPhishingAttacks.pdf (Size: 362.35 KB / Downloads: 114)
Introduction
Phishing is a form of identity theft that occurs when a malicious Web site
impersonates a legitimate one in order to acquire sensitive information such as
passwords, account details, or credit card numbers. Though there are several antiphishing
software and techniques for detecting potential phishing attempts in emails
and detecting phishing contents on websites, phishers come up with new and hybrid
techniques to circumvent the available software and techniques.
Phishing is a deception technique that utilizes a combination of social engineering
and technology to gather sensitive and personal information, such as passwords and
credit card details by masquerading as a trustworthy person or business in an
electronic communication. Phishing makes use of spoofed emails that are made to
look authentic and purported to be coming from legitimate sources like financial
institutions, ecommerce sites etc., to lure users to visit fraudulent websites through
links provided in the phishing email. The fraudulent websites are designed to mimic
the look of a real company webpage.
The phishing attacker’s trick users by employing different social engineering
tactics such as threatening to suspend user accounts if they do not complete the
account update process, provide other information to validate their accounts or some
other reasons to get the users to visit their spoofed web pages.
Data Used
To implement and test our approach, we have used two publicly available datasets
i.e., the ham corpora from the SpamAssassin project as legitimate emails and the
emails from PhishingCorpus as phishing emails (Phishing 2006, Spam 2006). The
total number of emails used in our approach is 4000. Out of which 973 are used as
phishing emails and 3027 as legitimate (ham) emails. The entire dataset is divided
into two parts for testing and training purpose. A total of 2000 emails are considered
as training samples and the remaining are considered for testing purpose.
Experiments
To evaluate our implementation, we used different machine learning methods and a
clustering technique on our phishing dataset. We used Support Vector Machines
(SVM, Biased SVM & Leave One Model Out), Neural Networks, Self Organizing
Maps (SOMs) and K-Means on the dataset described in section 3.
4.1 Model Selection of Support Vector Machines (SVMs)
In any predictive learning task, such as classification, both a model and a parameter
estimation method should be selected in order to achieve a high level of performance
of the learning machine. Recent approaches allow a wide class of models of varying
complexity to be chosen. Then the task of learning amounts to selecting the soughtafter
model of optimal complexity and estimating parameters from training data
(Chapelle 1999, Cherkassy 2002, Lee 2000).
Neural Networks
Artificial neural network consists of a collection of processing elements that are
highly interconnected and transform a set of inputs to a set of desired outputs. The
result of the transformation is determined by the characteristics of the elements and
the weights associated with the interconnections among them. A neural network
conducts an analysis of the information and provides a probability estimate that it
matches with the data it has been trained to recognize. The neural network gains the
experience initially by training the system with both the input and output of the
desired problem. The network configuration is refined until satisfactory results are
obtained. The neural network gains experience over a period as it is being trained on
the data related to the problem. Since a (multi-layer feedforward) ANN is capable of
making multi-class classifications, a single ANN (Scaled Conjugate Gradient), is
employed for classification, using the same training and testing sets.
K-Means
K-means clustering is an unsupervised non-hierarchal clustering. This attempts to
improve the estimate of the mean of each cluster and re-classifies each sample to the
cluster with nearest mean. Practical approaches to clustering use an iterative
procedure, which converges to one of numerous local points. These iterative
techniques are sensitive to initial starting conditions. The refined initial starting
condition allows the iterative algorithm to converge to a “better” local point. The
procedure is being used in k-means clustering algorithm which being used for both
discrete and continuous data points. Let us consider a n example feature vectors x1, x2,
..., xn all from the same class, and we know that they fall into k compact clusters, k <
n. Let mi be the mean of the vectors in Cluster I. If the clusters are well separated, we
can use a minimum-distance classifier to separate them. That is, we can say that x is
in Cluster i if | x - mi | is the minimum of all the k distances (Witten 2005).
Summary and Future Work
Although the performance of six different machine learning methods used is
comparable, we found that Support Vector Machine (LIBSVM) achieved consistently
the best results. Biased Support Vector Machine (BSVM) and Artificial Neural
Networks gave the same accuracy of 97.99%.
We have added new features to what researchers have published in literature. The
classifiers used in this paper showed comparable or better performance in some cases
when compared to the ones reported in the literature using the same datasets. Our
results demonstrate the potential of using learning machines in detecting and
classifying phishing emails. As a future work we plan to use more machine learning
algorithms to compare accuracy rates. We also plan to do a thorough feature ranking
and selection on the same data set to come up with the set of features that produces
the best accuracy consistently by all the classifiers.