31-10-2016, 03:27 PM
1462554027-Abstract.docx (Size: 80 KB / Downloads: 4)
Abstract - Spam are irrelevant messages which are sent mainly through the medium of Emails. Spam mails have evolved in terms of content throughout the ages in order to escape being classified as spams. These mails mainly contain links to phishing websites which attempt to acquire sensitive user information. In order to escape spam filters employed in mailing services, a lot of changes are made to the content of the spam such as introducing a lot of misspelled words so that they are not classified as words usually present in a spam e-mail. The problem of spam detection filters is the inaccuracy in classifying spam emails due to the lack of the ability to identify changes made to their content. In order to determine whether an email can be classified as a spam or not we have used Levenshtein Distance to specifically aim at the problem of misspellings which are introduced purposefully in these mails. The count of a special set of words mainly used in spam email have been taken as attributes and two major ensemble techniques has been used to classify whether a mail is spam or ham.
Keywords - phishing, Levenshtein distance, attributes, ensemble, spam, ham
Introduction
Spam is a word commonly used to describe unsolicited email sent in bulk opposed to regular emails called Ham. Spam emails are usually used to distribute viruses or direct users to bogus websites in order to gain their personal information. Spam accounts for 14.5 - 94 billion messages globally per day and each year the spamming industry makes about $200 million off its unwanted messages. Whenever spam filters classify an email as spam they end up in the spam folders which are never frequented by the user. In the recent years spam content has evolved in order to escape such filter and end up in the user’s inbox instead. This poses a huge security problem to the user as they may be directed to phishing websites putting their personal information in jeopardy. One of the main changes brought about in spam content is misspelling of special keywords such that their original meaning is still preserved and can be perceived by the user. These special keywords are usually used in spam emails. Various email service provider's spam filters depend on these special keywords in order to classify whether an email as spam or ham and hence introduction of misspellings make these filters hugely inaccurate. Some of these keywords can be: ‘address’, ‘credit’, ‘money’, ‘free’, ‘business’, etc. These keywords can be easily misspelled without compromising their original meaning as : ‘ad6ress’ , ‘cr3dit’ , ‘m0ney’ , ‘fr3e’ , ‘BUSINE5S’ and so on.
Spam detection techniques are classified as either black listing or content filtering. In this paper we have come up with a method for content filtering which will enable filters to identify purposefully misspelled words in an email. A word is first taken and converted to its root form. Then an approximate string matching algorithm called Levenshtein distance is used to find the degree of misspelling of the word in comparison to a special keyword, based on which it is classified as either as a spam keyword or a normal word. Once the data set is developed an ensemble learning algorithm called ‘Random Forests’ is used on the data set. This algorithm is highly efficient due to its Robustness over other ensemble learning algorithms such as AdaBoost and their ability to correct over fitting.
Proposed Work:
One of the most important modes of communication today is the Email and it is playing a vital role in the communication. Email is not only used by professionals today but everyone in this generation has to know how to send emails as it has become the most common mode of the communication. Taking the advantage of this, many spam emails have been produced and thus prevention has become one of the important tasks. It can thus be addressed in three ways: Prevention, Detection and Action on the spam emails. Prevention of the circulation of spam emails has become very difficult and thus what we can do is Detection and taking Action on the spam emails. Here we analyse the content of emails and perform necessary operations to detect whether an email is spam or ham. We consider different keywords which are frequently used in the spam emails as attributes and the frequencies of these keywords are their attribute values where each tuple corresponds to an email. Then we train the classifier once enough number of tuples is formed. In order to test whether an email is spam or ham, we first perform pre-processing on the email content and then extract the various attribute values from it. Then this tuple is fed to the classifier to get the class.
Dataset Description
The dataset is actually the frequencies of different words, special character and string of capital letters which are used in spam emails. There are 58 attributes and the last one among them is the class and the rest 57 are for the training. Now we will discuss about the 57 attributes which are being used in the dataset:
• The first 48 attributes have continuous real values ranging between 0-100.
Attribute name: word_freq_WORD, where WORD refers to a keywords in spam emails.
Attribute value:
(Number of time the WORD appears in the e-mail) x 100
Total number of words in the e-mail
• The next 6 attributes(49th to 54th) have real continuous values between 0-100 and check for the frequency of special character which appear in emails
Attribute name: word_freq_CHAR, where CHAR refers to a special character in spam emails.
Attribute value:
(Number of time the CHAR appears in the e-mail) x 100
Total number of characters in the e-mail
• The next single attribute(55th) measure the average length of a string of capital letters
Attribute name: capital_run_length_average
Attribute value: average length of uninterrupted sequences of capital letters
• The next single attribute (56th) measure the longest length among all the strings of capital letters
Attribute name: capital_run_length_longest
Attribute value: length of the longest uninterrupted sequence of capital letters.
• The next single attribute (56th) measure the total number of capital letters in the email.
Attribute name: capital_run_length_total
Attribute value: sum of the length of all uninterrupted sequence of capital letters.
Proposed Work
As the email is received the pre-processing of the content in the email is done. The pre-processing stage consists of the following two stages:
Tokenization: The email is tokenized into words so as to eliminate words such as ‘is’, ‘as’, ‘to’ and so on.
Lemmatization: The left over tokenized words are then reduced to their root form. For example: addressing is reduced to address.
Once the word is lemmatized the resulting word is checked with the original word to see if there are any changes. If changes are not detected, i.e they have the same length then we can infer that the word was already in the root form or there is a misspelling, otherwise we can infer that the original word was not in the root form and has been converted to its root. Now the lemmatized word is checked with the word in each attribute which basically are the word found in spam emails. This checking involves the Levenshtein distance algorithm which measures the degree of difference between two words. The Levenshtein Distance is an approximate string matching algorithm which calculates the edit distance between two words, i.e how many operations are required to convert one word into the other. These operations can be :
Substitution :Substituting one letter for another. Example : cat → mat (‘m’ is substituted for ‘c’)
Insertion : Inserting a letter in the word. Example : temple → temples (inserting ‘s’ at the end)
Deletion eleting a letter in the word. Example : brick →rick (deleting ‘b’)
If the Levenshtein distance between the words is 1 then the word can be considered as a misspelling and if the Levenshtein distance is 0 the it means both the words are same and hence it is accepted.. On the other hand if the Levenshtein distance is more than 1 the word is rejected. If a word is accepted then the corresponding value of the attribute is incremented otherwise it remains the same. Once all the values are found for an attribute word we find it percentage in the email by formula discussed above. The same is repeated for each and every attribute word. This way the data tuple is developed which is then fed to the classifier to classify the mail as spam or ham.
Now talking about the classifier, we are classifying using the ensembling method called ‘Random Forests’ to classify the emails. Random forests is just another name for the random decision forests which are ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision tree at training time and outputting the class that is the mode of the classes(classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. This method combines bagging and random selection of features in order to construct a collection of decision trees.
The training algorithm applies the general technique of decision trees. According the number of estimators the process will be carried out. Suppose we select for the n estimators, we carry out
1. Sample, with replacement, m training examples from the dataset
2. Train a decision tree with subset
Carry out this n times to create n trees.
Random forest forms random dataset with replacement by selection of random tuples and random attributes which is the subset of the dataset and these many contain few features and may be the better predictors and these get selected in various subsets and thus forms the decision trees causing them to correlate.
After that we pass the unknown tuple into the random forest classifier to predict the class. The class is predicted by the majority vote in this case of decision trees.
Advantages of Random Forests:
1. It runs efficiently on large datasets
2. It either goes for high bias or high variance, stays in the intermediate stage giving us better results.
3. Random decision forests correct for decision trees' habit of overfitting
4. It maintains accuracy when large amount of data is missing.
5. As different trees are constructed using the different sub samples so none of the important parameters will be missed from the data set.
Results and Analysis:
Here we are concentrating to increasing the precision and recall with the help of the Random Forests.
//confusion matrix of random forest
CONCLUSION
The content in spam keeps on undergoing various changes in order to escape detection by spam filters. This paper particularly deals with the problem of purposeful making of misspelling in order to escape the spam filters employed by email service providers. We have proposed a method to prevent these emails from being classified as ham. We have employed the Levenshtein Distance Algorithm to detect the use of purposeful misspelling. Then we have opted for an ensemble technique which has a better accuracy than most ensemble techniques. Random Forest creates various random subsets of the Data set and uses it to create decision trees. These subsets are created based on data tuples and features/attributes. The dataset may contain a few features which are very strong predictors for the response variable (target output) and these features get selected into various data subsets which are used to create Decision trees, causing them to become correlated. These decision trees classify a test tuple and give the class. Each class gets a vote if a decision tree classifies the tuple to be of that class. At the end the class with the highest vote is the one which is taken up to be the class of the tuple. Hence the above proposed method can be used to identify misspelled emails and determine whether it is a spam or not.