17-09-2016, 04:40 PM
1455203677-6111455191534044.doc (Size: 184 KB / Downloads: 8)
Accelerated PSO Swarm Search Feature
Selection for Data Stream Mining Big Data
Simon Fong, Raymond Wong, and Athanasios V. Vasilakos, Senior Member, IEEE
Abstract—Big Data though it is a hype up-springing many technical challenges that confront both academic research communities
and commercial IT deployment, the root sources of Big Data are founded on data streams and the curse of dimensionality. It is
generally known that data which are sourced from data streams accumulate continuously making traditional batch-based model
induction algorithms infeasible for real-time data mining. Feature selection has been popularly used to lighten the processing load in
inducing a data mining model. However, when it comes to mining over high dimensional data the search space from which an optimal
feature subset is derived grows exponentially in size, leading to an intractable demand in computation. In order to tackle this problem
which is mainly based on the high-dimensionality and streaming format of data feeds in Big Data, a novel lightweight feature selection is
33
proposed. The feature selection is designed particularly for mining streaming data on the fly, by using accelerated particle swarm
optimization (APSO) type of swarm search that achieves enhanced analytical accuracy within reasonable processing time. In this
paper, a collection of Big Data with exceptionally large degree of dimensionality are put undertest of our new feature selection
algorithm for performance evaluation.
Index Terms—Feature selection, metaheuristics, swarm intelligence, classification, big data, particle swarm optimization
1 INTRODUCTION
ECENTLY a lot of news in the media advocates the hype of
Big Data that are manifested in three problematic issues.
They are the 3V challenges known as: Velocity problem that
gives rise to a huge amount of data to be handled at an esca-
lating high speed; Variety problem that makes data process-
ing and integration difficult because the data come from
various sources and they are formatted differently; and Vol-
ume problem that makes storing, processing, and analysis
over them both computational and archiving challenging.
In views of these 3V challenges, the traditional data
mining approaches which are based on the full batch-
mode learning may run short in meeting the demand of
analytic efficiency. That is simply because the traditional
data mining model construction techniques require load-
ing in the full set of data, and then the data are parti-
tioned according to some divide—and—conquer strategy;
two classical algorithms are Classification And Regres-
sion Tree algorithm (CART) for decision tree induction
[1] and Rough—set discrimination [2]. Each time when
fresh data arrive, which is typical in the data collection
process that makes the big data inflate to bigger data,
the traditional induction method needs to re—run and the
model that was built needs to be built again with the
inclusion of new data.
a S. Pong is with the Department of Computer and Information Science,
University of Macau, Taipa, Macau SAR.. E-mail: ccfong[at]umac.mo.
o R. Wong is with the School of Computer Science and Engineering, Univer-
sity ofNew South Wales, Kensington, NSW2052, Australia.
E-mail: wong[at]cse.unsw.edu.au.
o A. V. Athanasios is with the Department of Computer Science, Electrical
and Space Engineering, Lulea University of Technology 97187, Lulea,
Sweden. E-mail: th.vasilakos[at]gmail.com.
Manuscript received 31 Oct. 2014; revised 21 Apr. 2015; accepted 9 May
2015. Date of publication 31 May 2015; date of current version 10 Feb. 2016.
For information on obtaining reprints of this article, please send e-mail to:
reprints[at]ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TSC20152439695
In contrast, the new breed of algorithms known as data
stream mining methods [3] are able to subside these 3V prob-
lems of big data, since these 3V challenges are mainly the
characteristics of data streams. Data stream algorithm is not
stemmed by the huge volume or high speed data collection.
The algorithm is capable of inducing a classification or pre-
diction model from bottom—up approach; each pass of data
from the data streams triggers the model to incrementally
update itself without the need of reloading any previously
seen data. This type of algorithms can potentially handle
data streams that amount to infinity, and they can run in
memory analyzing and mining data streams on the fly. It is
regarded as a killer method for big data hype and its related
analytics problems. Lately researchers concur data stream
mining algorithms are meant to be solutions to tackle big
data for now and for the future years to come [4], [5].
In both families of data mining algorithms, stream—based
and batch—based, classification has been widely adopted for
supporting inferring decisions from big data. In supervised
learning, a classification model or classifier is trained by
inducing the relationships between the attributes of the his-
torical records and the class labels which are usually the
predictor features of all the data and their predicted classes
respectively. Subsequently, the classifier is used to predict
appropriate classes given unseen samples.
In classifier applications, feature selection (FS) attempts to
select a subset of the most influential features by excluding
irrelevant and redundant features in order to enhance accu-
racy and speedup model training time for the classifier. In
the past many computer science researchers studied about
using heuristics to tackle the feature selection problems.
However it is recently reported [6] that many proposed
methods are limited to one or more of the following con-
straints in their designs. (1) The size of the resultant feature
set is assumed fixed. Users are required to explicitly specify
the maximum dimension for feature subset.