IEEE TRANSACTIONS ON SERVICES COMPUTING

**mkaasees** · 17-09-2016, 04:40 PM

1455203677-6111455191534044.doc (Size: 184 KB / Downloads: 8)

Accelerated PSO Swarm Search Feature
Selection for Data Stream Mining Big Data

Simon Fong, Raymond Wong, and Athanasios V. Vasilakos, Senior Member, IEEE

Abstract—Big Data though it is a hype up-springing many technical challenges that confront both academic research communities
and commercial IT deployment, the root sources of Big Data are founded on data streams and the curse of dimensionality. It is
generally known that data which are sourced from data streams accumulate continuously making traditional batch-based model
induction algorithms infeasible for real-time data mining. Feature selection has been popularly used to lighten the processing load in
inducing a data mining model. However, when it comes to mining over high dimensional data the search space from which an optimal
feature subset is derived grows exponentially in size, leading to an intractable demand in computation. In order to tackle this problem
which is mainly based on the high-dimensionality and streaming format of data feeds in Big Data, a novel lightweight feature selection is

33

proposed. The feature selection is designed particularly for mining streaming data on the fly, by using accelerated particle swarm
optimization (APSO) type of swarm search that achieves enhanced analytical accuracy within reasonable processing time. In this
paper, a collection of Big Data with exceptionally large degree of dimensionality are put undertest of our new feature selection

algorithm for performance evaluation.

Index Terms—Feature selection, metaheuristics, swarm intelligence, classification, big data, particle swarm optimization

1 INTRODUCTION

ECENTLY a lot of news in the media advocates the hype of

Big Data that are manifested in three problematic issues.
They are the 3V challenges known as: Velocity problem that
gives rise to a huge amount of data to be handled at an esca-
lating high speed; Variety problem that makes data process-
ing and integration difficult because the data come from
various sources and they are formatted differently; and Vol-
ume problem that makes storing, processing, and analysis
over them both computational and archiving challenging.

In views of these 3V challenges, the traditional data
mining approaches which are based on the full batch-
mode learning may run short in meeting the demand of
analytic efficiency. That is simply because the traditional
data mining model construction techniques require load-
ing in the full set of data, and then the data are parti-
tioned according to some divide—and—conquer strategy;
two classical algorithms are Classification And Regres-
sion Tree algorithm (CART) for decision tree induction
[1] and Rough—set discrimination [2]. Each time when
fresh data arrive, which is typical in the data collection
process that makes the big data inﬂate to bigger data,
the traditional induction method needs to re—run and the
model that was built needs to be built again with the
inclusion of new data.

a S. Pong is with the Department of Computer and Information Science,
University of Macau, Taipa, Macau SAR.. E-mail: ccfong[at]umac.mo.

o R. Wong is with the School of Computer Science and Engineering, Univer-
sity ofNew South Wales, Kensington, NSW2052, Australia.
E-mail: wong[at]cse.unsw.edu.au.

o A. V. Athanasios is with the Department of Computer Science, Electrical
and Space Engineering, Lulea University of Technology 97187, Lulea,
Sweden. E-mail: th.vasilakos[at]gmail.com.

Manuscript received 31 Oct. 2014; revised 21 Apr. 2015; accepted 9 May
2015. Date of publication 31 May 2015; date of current version 10 Feb. 2016.
For information on obtaining reprints of this article, please send e-mail to:
reprints[at]ieee.org, and reference the Digital Object Identifier below.

Digital Object Identifier no. 10.1109/TSC20152439695

In contrast, the new breed of algorithms known as data
stream mining methods [3] are able to subside these 3V prob-
lems of big data, since these 3V challenges are mainly the
characteristics of data streams. Data stream algorithm is not
stemmed by the huge volume or high speed data collection.
The algorithm is capable of inducing a classification or pre-
diction model from bottom—up approach; each pass of data
from the data streams triggers the model to incrementally
update itself without the need of reloading any previously
seen data. This type of algorithms can potentially handle
data streams that amount to infinity, and they can run in
memory analyzing and mining data streams on the ﬂy. It is
regarded as a killer method for big data hype and its related
analytics problems. Lately researchers concur data stream
mining algorithms are meant to be solutions to tackle big
data for now and for the future years to come [4], [5].

In both families of data mining algorithms, stream—based
and batch—based, classification has been widely adopted for
supporting inferring decisions from big data. In supervised
learning, a classification model or classifier is trained by
inducing the relationships between the attributes of the his-
torical records and the class labels which are usually the
predictor features of all the data and their predicted classes
respectively. Subsequently, the classifier is used to predict
appropriate classes given unseen samples.

In classifier applications, feature selection (FS) attempts to
select a subset of the most inﬂuential features by excluding
irrelevant and redundant features in order to enhance accu-
racy and speedup model training time for the classifier. In
the past many computer science researchers studied about
using heuristics to tackle the feature selection problems.

However it is recently reported [6] that many proposed
methods are limited to one or more of the following con-
straints in their designs. (1) The size of the resultant feature
set is assumed fixed. Users are required to explicitly specify
the maximum dimension for feature subset.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	IEEE projects for ec in Bangalore	Darshan hegde	2	1,309	07-07-2022, 11:04 AM Last Post: Arunprakash01
	Wide Area Mobile Data Services	seminar ideas	1	2,373	19-09-2017, 02:35 PM Last Post: jaseela123
	INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY PROJECT	project maker	1	708	14-09-2017, 01:15 PM Last Post: jaseela123
	SYSTEM-LEVEL MODELLING OF IEEE 802.16E MOBILE WIMAX NETWORK: KEY ISSUES REPORT	project girl	1	1,143	13-09-2017, 02:51 PM Last Post: jaseela123
	Smart Assessment Services	project maker	1	521	12-09-2017, 11:11 AM Last Post: jaseela123
	distributed computing	mkaasees	0	253	25-08-2017, 09:32 PM Last Post: mkaasees
	DOT NET Based TRANSACTIONS ON DISTRIBUTED SYSTEMS Project ideas	electronics seminars	0	12,033,852	25-08-2017, 09:32 PM Last Post: electronics seminars
	DOT NET Based DOMAIN TRANSACTIONS ON DATA MINING Project Ideas	electronics seminars	0	8,337,422	25-08-2017, 09:32 PM Last Post: electronics seminars
	Dot Net DOMAIN DOMAIN TRANSACTIONS ON DISTRIBUTED NETWORKING	electronics seminars	0	9,424,703	25-08-2017, 09:32 PM Last Post: electronics seminars
	DOT NET Based DOMAIN TRANSACTIONS ON GRID COMPUTING	electronics seminars	0	8,449,969	25-08-2017, 09:32 PM Last Post: electronics seminars

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.