05-09-2016, 11:05 AM
1452695409-1.doc (Size: 245.5 KB / Downloads: 5)
ABSTRACT: This paper surveys the present data mining techniques used in detecting intrusions in a computer networks. The basic objective of this survey paper is to review the current methodologies of data mining technique being used in Intrusion Detection System. The paper focuses on the usage of various methodologies of data mining technique like clustering, classification and other data mining rules. The results suggest that classification methodology is being widely used for solving intruder-based problems and Support Vector Machine (SVM) remains popular within this arena, for the researchers. Similarly, in clustering technique, statistical-based, conditional probability i.e. Bayesian clustering, and its native are used for categorizing attack from a non-attack. Even if these methodologies score well in intrusion detection, the hybrid models introduced generate good performances in lowering false alarms.
KEYWORDS: Intrusion Detection System (IDS), Data Mining, Support Vector Machine (SVM), Clustering, Classification, Decision tree, C4.5
I. INTRODUCTION
The usage of computer and its applications have undeniably increased in the past few decades. The increased reliance of government, military and commercial organizations on computer and its applications have equally increased the threats on the computing systems. As a result security of our computer systems and data is at continual risk. Due to the extensive growth of the Internet and increasing availability of tools and tricks for intruding and attacking networks, new challenges arise in order to combat external attacks. Such attacks external to these bodies are deliberate in action against data, software or hardware and can destroy, degrade, disrupt or deny access to a network computer system. Intrusions are such deliberate attacks.
In an attempt to guard against the unknown intrusions, much effort has been given in researching and developing Intrusion Detection System (IDS), which tries to filter out such attacks from the network traffic. IDS are software tools meant specifically for strengthening the security of information and communication systems. An IDS dynamically monitors logs and network traffic, applies detection algorithms to identify intrusions in a network. Intrusion detection is based on the assumption that intrusive activities are noticeably different from normal system activities and thus are identifiable. Intrusion detection is not used to replace prevention-based techniques such as authentication and access control; instead, it is intended to complement existing security measures. Intrusion detection system is therefore considered as a second line of defense for computer network systems to detect actions that bypass the security monitoring and control component of the system. Intrusion Detection Systems monitors and analyzes the events occurring in a computer and/or network system in order to detect signs of security problems and raise alarms. Basic approaches used are known pattern templates, threatening behavior templates, traffic analysis, statistical-anomaly detection and state-based detection.
Data Mining in Intrusion Detection
Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data. In recent years, data mining techniques employed in intrusion detection have proved to be successful. Data mining is the search for valuable information within large volumes of data by systematically exploring underlying patterns, trends, and relationships hidden in available data.
Data mining is commonly employed within the area of intrusion detection to find the hidden patterns of intrusions and their relationship among each other. Data mining techniques for intrusion detection are namely: Clustering frequent pattern mining, classification, mining data streams etc. Data mining can help to learn from traffic data using supervised learning approach by learning precise models from past intrusions or unsupervised learning approach by identifying suspicious activities
II. LITERATURE REVIEW
G. V. Nadianmai and M. Hemalathain in their paper ―Effective approach toward Intrusion Detection System using data mining techniques‖, considered four issues namely Classification of Data, High Level of Human Interaction, Lack of Labeled Data, and Effectiveness of Distributed Denial of Service attack and solved them using the proposed algorithms EDADT algorithm, Hybrid IDS model, Semi-Supervised Approach and Varying HOPE RAA algorithm respectively. To solve the problem related to classification of data, an enhanced data adapted decision tree algorithm is implemented which effectively classifies the data into normal and attack without any classification. To minimize the workload of a network administrator, a high level of human interaction based on SNORT and anomaly based approaches are being used. This has a Hybrid IDS that automatically classifies the data based on the pre-defined rules within it. The issue related to belling the unlabeled data is solved using Semi-Supervised Approach where with the small amount of labeled data, the large amount of unlabeled data can be labeled. The last problem related to Distributed Denial of Service Attack is addressed by using varying clock drift. This varying clock drift in network based applications makes it difficult for the intruder to access the port that has been used by the legitimate client.
W. Feng et al. in their paper ―Mining network data for intrusion detection through combining SVMs with Ant Colony Networks‖ achieved better performance in both detection accuracy rate and faster running time by combining two existing machine learning methods (SVM and CSOACN). Their proposed work is based on five main interactive modules. The main contributions of this paper include the modifications to the supervised learning SVM and the unsupervised learning CSOACN so that they can be used together interactively and efficiently. It also combines the modified SVM and CSOACN to minimize the training data set while allowing new data points to be added to the training set dynamically.
Muamer N. Mohammad et al. use Weka software, which is a collection of machine learning algorithm for data mining tasks in their paper ―A Novel Intrusion Detection System by using Intelligent Data Mining in Weka Environment‖.
Weka contains tools for data pre-processing, classification, regression, clustering, association rules and visualization. Weka consists of Explorer, Experimenter, Knowledge flow, Simple Command Line Interface. The proposed system has four main steps to execute. The first step is to collect network data and pre-treat them as network connection data including particular attributes as protocol type, destination IP address and flag bit. Then the next step is to use association analysis data mining algorithm to handle the connection data and get association rules, thereby obtaining the normal behavior patterns which can be used for abnormal intrusion detection. Finally, the proposed work uses classification algorithm to carry out rule mining to further distinguish normal behavior and intrusion behavior.
The proposed work by Karim Al-Saedi et al. in their paper ―Research Proposal: An Intrusion Detection system Alert Reduction and assessment Framework Based on Data Mining‖ is IDS Alert Reduction and Assessment based on Data
Mining (ARADMF) which contains three systems: Traffic data retrieval and collection mechanism system, reduction IDS alert processes system and threat score process of IDS alert system. The traffic data retrieval and collection mechanism systems develops a mechanism to save IDS alerts, extract the standard features as intrusion detection exchange format and save them in DB file(CSV-type). It contains the Intrusion Detection Message Exchange Format(IDMEF) which works as procurement alerts and field reduction is used as data standardization to make format of alert as standard as possible [4].
This algorithm consists of three phases. The first phase removes redundant alerts; the second phase reduces false alerts based on threshold time value and the last phase reduces false alerts based on rules with a threshold common vulnerabilities and exposure value. Threat score process of IDS alert system is characterized by assigning scores based on automated classification of alerts. The expected outcome reduces the number of false positive alert with rate expected 90% and increasing the level of accuracy compared with other approaches.
Basant Agarwal and Namita Mittal proposed in their paper ―Hybrid Approach for Detection of Anomaly Network Traffic using Data Mining Techniques‖ a hybrid approach that exploits the benefits of both the techniques i.e. entropy based and support vector machine based respectively. Hybrid anomaly detection system learns the behavior of network traffic from the normalized entropy values of different network features. Entropy based techniques have the advantage of better representing the properties of the network traffic and support vector machine is good for classification.
The normalized entropies are sent to SVM model for learning the behavior of the network. This trained SVM model can classify the network traffic in attack traffic or legitimate traffic. In entropy based anomaly detection system, firstly normalized entropy of network traffic features is calculated in every 60 seconds. Threshold value is fixed for each feature for identifying the anomalies based on experiments. Then voting system for each feature decides whether there is an attack or not. This method is able to produce good results in case of detecting attack traffic but it also produces high false alarms, because the entropy values can also deviate from the range or towards 0 or 1 in case of legitimate traffic [5].
Augustin Orfila, Javier Carbo and Arturo Ribagorda proposed a system based on multiagents to improve the overall IDS effectiveness through an autonomous adaptation in their paper ―Autonomous decision on Intrusion Detection with trained BDI agents‖. The system is composed of several cooperative agents that play one of the following roles: sensor, evaluator or manager. Each sensor agent applies a specific detection algorithm to infer a prediction about the intrusive nature of the attack. The predictions are often binary in statement indicating the intrusive or non-intrusive behaviors that are sent to evaluator agents. The evaluator agents further combine them to produce a final conclusion which is sent to the manager agent. Evaluator agents apply two different criteria to conclude the nature of attack. The two criteria that are considered are: Threshold: The evaluator agent considers an event as an intrusion if the number of sensor agents that state the event as intrusive is greater than a prefixed threshold [6].Weighted sum: the evaluator agent weighs each sensor before the comparison with the predefined threshold is done. The weights are updated after an event takes place according to the historical effectiveness of each sensor [6].
The manager agent finally may act in two different modes of behavior: evaluation mode or operating mode. In the former one it is assumed that the manager agent has the knowledge about the real nature of events, so it is able to inform the elevator agents about their effectiveness. The evaluator agents in turn will utilize this information in order to update the weights of sensor agents, for future predictions. The operating mode consists of planning a response to intrusions, according to the beliefs previously acquired about the environment where events are taking place and about the results provided in the evaluation mode.
Tadeusz Pietraszek et al. proposed two complementary approaches CLARAty and ALAC to be utilized together in a two-staged alert filtering and classification system in their paper ―Data Mining and machine learning-Towards reducing false positives in intrusion detection‖. The proposed system uses CLARAty in first-stage to periodically mine raw alerts and discover their root causes. Then it would either remove them or install alert filters. The output of CLARAty would then be forwarded to ALAC interacting with an operator. The major benefit of this approach is that it alert filters from CLARAty remove the most prevalent and uninteresting false positives, which effectively improves class distribution in favour of true positives in the alerts passed on to the second stage [7]. ALAC receives fewer alerts to process and is an adaptive alert classifier based on feedback of an intrusion detection analyst and machine learning techniques.
Experiments with real-world data sets have shown that already few dozens of generalized alerts cover over 90% of the raw alerts [7].
According to Cheng Xiang, Png Chin Yong and Lim Swee Menzns’ paper on ―Design of multiple-level hybrid classifier for intrusion detection system using Bayesian clustering and decision trees‖ detection rate can be increased by
implementing a new multiple-level intrusion hybrid classifier. A model with 4 stages of classification is used for the hybrid classifier. The first level of classification categorizes the test data into 3 categories (DOS, Probe, Others). U2R and R2L and the Normal connections are classified as ―Others‖ in this stage. The second stage splits ―Others‖ into
Attack and Normal categories, while the third stage separates the Attack class from Stage 2 into U2R and R2L. The fourth stage further classifies the attacks into more specific attack types.
This classification is only effective for known attacks as it requires that particular type of belabored training data to be present [8].
Xiao-Bai Li in his paper ―A scalable decision tree system and its application in pattern recognition and intrusion detection‖ proposed a new decision tree algorithm, named SURPASS (for Scaling Up Recursive Partitioning with
Sufficient Statistics), that is highly efficient in handling large data. It is based on an efficient gathering of sufficient statistics. The algorithm effectively solves the problem of mining large numeric data for classification when the data size is beyond the capacity of the main memory [9]. The algorithm is based on uni-variate or multivariate splits. It is specialized in dealing with numeric data. When categorical data are presented, categorical values need to be processed with binary (0-1) coding. When there exists many categorical attributes each having a large number of categories, the coding process involves creating a large number of additional binary attributes. This could cause computational problems and thus a more natural approach to deal with mixed data type is to handle numeric and categorical data using different sufficient statistics. For numeric data, the summation statistics could be used. The quality of split can be evaluated using the same impurity measure such as entropy, no matter whether the split is based on numeric or categorical attributes [9]. The results indicate that the proposed algorithm produces decision trees with very high quality in terms of classification accuracy and the algorithm scores well against large data sets, with computing time approximately linear in terms of magnitude of number of records in the data.
A Lightweight Network Intrusion Detection System (LNID) is proposed for detecting attacks on Telnet traffic by Chi-Mei Chen et al in their paper ―An efficient network intrusion detection‖. According to their proposed work, normal traffic behavior is taken into consideration and anomaly score of a packet based on deviation from the normal behavior is computed. Instead of processing all traffic packets, an efficient filtering scheme is dudes to reduce the system workload. The filtering scheme consists of 2 packet filters: Tcpdump filter and LNID filter. The former, processes initial packet filtering with tcpdump tool, extracting TCP packets towards Telnet servers of internal local area network
[10]. The module ―LNID Filter‖ fetches the first packets of each Telnet connection. A TCP connection is established after a 3-way handshaking which includes SYN, SYN-ACK and ACK. The module ―Anomaly Scoring‖ computes the anomaly score based on each attribute characterizing the normal behavior. As attack packets are sent right after a connection is built, thus, the proposed method adopts first 48 byte packet data for anomaly behavior detection [10]. The proposed method uses the module ―Post Process‖ to remove multiple alerts. If multiple anomaly packets arrive, then multiple-alerts have to be propagated. The results show that LNID has a simple and efficient anomaly score function for detecting 86.4% of U2R and R2L attacks in Telnet connections. LNID is restricted to TELNET and has highest detection rate on R2L and U2R.
Jaehak Yu et al. proposed, designed and implemented a system that detects traffic flooding attacks and executes classification by the attack type and uses SNMP MIB (Simple Network Management Protocol) MIB (Management
Information Base) based on C4.5 algorithm in their paper ―An in-depth analysis on traffic flooding attacks detection and system using data mining techniques‖.
The proposed system is composed of 3 modules: SNMP MIB generators (for online processing) module, MIB update detection and MIB data store, attack detection and classification module and for offline processing, C4.5 training and association rule mining module and lastly the system administrator as a management module. SNMP MIB generator‘s module generates MIB information from the network traffic data;MIB update detection module stores only the MIB information that is determined in the C4.% training module from the target system.; the collected information is transferred to attack detection, classification module and then it is used to judge the occurrence of attacks and the attack type in real-time; C4.% training module randomly generates various traffic attacks to execute a C4.5-based learning; Association rule module conducts an in-depth semantic interpretation that extracts and analyzes the data characteristics
of the data stored in the MIB data store module in a form of rule and System management module detects traffic flooding attack in real-time. Monitors detailed information about classification type [11].
In this paper ―An active learning based TCM-KNN algorithm for supervised network intrusion detection‖, by Yang Li and Li Guo a novel supervised network intrusion detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) machine learning algorithm and active learning based training data selection method is proposed. It can effectively detect anomalies with high detection rate, low false positives under the circumstance of using much fewer selected data as well as selected features for training in comparison with the traditional supervised intrusion detection methods [12]. A series of experimental results on the well-known KDD Cup 1999 data set demonstrate that the proposed method is more robust and effective than the state-of-the-art intrusion detection methods.
TCM-KNN algorithm is commonly used machine learning and data mining method, thus effective in fraud detection, pattern recognition and outlier detection. It is the first time that TCM-KNN algorithm is applied to intrusion detection. Contrast experimental results demonstrate that it has good detection performance (high detection rate and low false positives) even when provided with ‗‗small‘‘ data set for training than the state-of-the-art intrusion detection technique [12]. An active learning based TCM-KNN algorithm for supervised network intrusion [12]. They further optimize it for intrusion detection in two aspects: (a) introduce active learning method to select much fewer good quality data for training than traditional random sampling, thus alleviate the large amounts of labeling workload for domain experts and reduce the scale of training data set, and consequently reduce the computational cost of TCM-KNN, and (b) feature selection method is proposed to select the most necessary and important features for TCM-KNN.
In this paper ―Data-mining based SQL injection attack detection using internal query trees‖, Mi-Yeon Kim and Dong Hoon Lee, proposed a framework to detect SQLIAs at database level by using SVM classification and various kernel functions. Detecting SQL injection attacks (SQLIAs) is becoming increasingly important in database-driven websites. Most of the studies on SQLIA detection have focused on the structured query language (SQL) structure at the application level and this approach inevitably fails to detect those attacks that use already stored procedure and data within the database system. The prime issue of SQLIA detection framework is how to represent the internal query tree collected from database log suitable for SVM classification algorithm in order to acquire good performance in detecting SQLIAs. To solve this issue, a novel method to convert the query tree into an n-dimensional feature vector by using a multi-dimensional sequence as an intermediate representation is proposed. The reason that it is difficult to directly convert the query tree into an n-dimensional feature vector is the complexity and variability of the query tree structure. Secondly a method to extract the syntactic features, as well as the semantic features when generating feature vector is proposed. Next they proposed a method to transform string feature values into numeric feature values, combining multiple statistical models. The combined model maps one string value to one numeric value by containing the multiple characteristic of each string value [13]. In order to demonstrate the feasibility of the proposals in practical environments, they implemented the SQLIA detection system based on PostgreSQL, a popular open source database system, and the experimental results using the internal query trees of PostgreSQL validate that the proposal is effective in detecting SQLIAs, with at least 99.6% of the probability that the probability for malicious queries to be correctly predicted as SQLIA is greater than the probability for normal queries to be incorrectly predicted as SQLIA.
In this study ―Application of SVM and ANN for intrusion detection‖, by Wun-Hwa Chen, Sheng-Hsun Hsu, Hwang-Pin Shen the feasibility of applying an Artificial Neural Network (ANN) and Support Vector Machine (SVM) to predict attacks based on frequency-based encoding techniques are determined. The goal of using ANN and SVM for attack detection is to develop a generalization capability from limited training data. In addition to comparing the ANN and SVM performances, they demonstrated other encoding methods in predicting attacks. The test bed used here is 1998 DARPA data from MIT‘s Lincoln Labs. Results indicated that SVM performance was superior to that of ANN and the encoding method is better than the simple frequency-based method. The superior performance of SVMs over ANNs is due to the following reasons: (1) SVMs implement the structural risk minimization principle which minimizes an upper bound for the generalization error rather than minimizing the training error.