04-05-2013, 04:18 PM
Opinion Formation Inspired Search Strategies for Feature Selection
Opinion Formation.pdf (Size: 2.42 MB / Downloads: 26)
Introduction
For a long time, there has been a gap between natural sciences and humanities. A particular
example is the dierence between technical and social science. Although it could seem quite
dicult to imagine some common elements shared by technical and social sciences, recently,
the cross in
uences and inspirations are turning to be much more frequent day by day. One of
relatively wide areas of these connections is the use of computational intelligence methods in
social sciences [1, 2]. Some of these sociological simulations were based on models of individual
actors realized simply by cellular automata. For example, Schelling's checkerboard model [3]
is very similar to cellular automata models. Among others, it is regarded to be one of the
examples of a good explanation in social sciences or one of the predecessors of agent based
computer models.
This thesis is oriented dierently. Instead of the application of computer methods in social
sciences, it tries to take a model from social psychology, adapt it, and use it in the area of
parameter optimization. It is an attempt to use simulated people to make a decisions about
solutions of an optimization problem. The simulation is based on simple opinion formation
(OF) models widely used in computational psychology and commonly analyzed by tools of
statistical physics. In the thesis, we create a novel population-based optimization methods,
in which the candidate solutions in
uence each other and try to converge into a "good"
consensus. The novel method called social impact theory based optimizer (SITO) is applied
here to the feature subset selection problem known from pattern recognition.
In many aspects, the algorithm is similar to particle swarm optimization (PSO). The
PSO algorithm [4] and its modications are based on the well known phenomena of swarm
intelligence. Swarm intelligence is the property of a system whereby the collective behaviors of
unsophisticated agents interacting locally with their environment cause coherent functional
global patterns to emerge. In other words, the swarm intelligence refers to the problemsolving
behavior that emerges from the interaction of the simple agents. Individuals within
the group interact to solve a global objective by exchanging locally available information,
which in the end propagates through the entire group such that the problem is solved more
eciently than can be done by a single individual.
Motivation and goals of the thesis
This chapter describes the main purpose of writing this thesis, denes the aims of this thesis,
and overviews its structure. The accomplishment of the aims is stated at the end of the thesis
in Section 8.2.
Motivation
Our main topic is the development of a novel socially-inspired optimizer and its application
on feature subset selection. Generally, randomized search heuristics such as evolutionary
computation methods or swarm intelligence methods are recognized as powerful optimization
algorithms and play an important role in modern algorithmics nowadays [6]. The methods
have been successfully used to solve dicult and complex real-world problems. An interesting
point is that most of these models are inspired solely by animal part of the nature.
Particularly, we know only a very limited number of examples of inspiration from human.
For example, some neural networks inspired by human brain can be used for optimization.
The evolutionary optimization techniques relate to human evolution. However, there is very
small number of approaches inspired by human behavior directly. Shi [7] argues: "Human
beings are social animals and are the most intelligent animals in the world. Therefore, it is
natural to expect that an optimization algorithm inspired by human creative problem solving
process will be a good optimization algorithm." Actually, Shi's recent algorithm called brain
storm optimization and Kennedy and Eberhart's binary particle swarm optimization are rare
examples of the methods inspired by human decision making. This is our motivation for the
research in this eld.
Organization of the thesis
The thesis is organized into 8 chapters. Previous chapter, Chapter 1, contains our formulation
of the feature selection problem. Chapter 2 describes motivation for our work, denes
main goals of the thesis, and contains this section. In Chapter 3, a state of the art of the
underlying topics is presented. The Chapter 4 brie
y describes some issues related to the
experimental work. Main problems and parts of the feature selection process are further described
in Chapter 5. The main description of the novel optimization techniques is included
in Chapter 6. Some improvements of the feature selection techniques are further proposed
and tested in Chapter 7. The thesis is concluded in Chapter 8.
State of the art
In this chapter, we rst try to review the wide area of feature selection and its recent approaches.
First, the feature selection algorithms are divided into three dierent types and
both the very well-known and very recent examples of particular classes are referred. Further,
we focus on somehow shorter description of dierent, mostly heuristical, search approaches.
Finally, a relevant area of swarm intelligence and particularly PSO is reviewed in details with
particular focus on the big number of algorithm modications. Some important applications
are also enumerated.
Feature subset selection methods
Jain and Zongker [23] taxonomized FS methods as depicted in Fig. 3.1. In this thesis, we deal
only with the branch called suboptimal methods. Blum and Langley [30] have grouped the
feature selection methods into three generic classes, namely embedded, lter, and wrapper
methods:
Embedded methods
The embedded methods embed the selection within the design of classier. These methods
usually evaluate the features within the induction process and select the best. The examples
are methods that induce logical description, training of decision tree based classiers, or
separate and conquer methods for learning decision lists.
Often, a greedy approach is used in the embedded methods for selection of features. This
causes problems if there is a strong interaction among the features [21] and separate features
may seem to be irrelevant [33]. Our formulation of the feature selection problem suppose the
feature selection to be isolated from the training procedure, thus we omit a detailed review
of the embedded methods.
Wrapper methods
Wrapper methods treat feature selection as a wrapper around the classier. The popular
classiers whose error estimates are used as selection criterion are decision trees [56, 57],
SVM [8, 58, 59, 60], NN classiers [61, 62, 63, 64], Naive Bayes algorithm [57] and many
other [65].
The wrappers are described above in Section 1.1.3. The wrappers are a popular approach
to feature selection [66]. They were used in [67, 68] and [69] to select features for decision
trees. The NN classier are expected to benet more from wrapper feature selection, because
they do not incorporate any embedded feature selection. In [70], OBLIVION algorithm is
described, which is based on NN classiers. There are many other wrapper approaches for
use with NN classiers [20, 71]. Inza et al. [72] used wrapper approach over naive Bayes
algorithm and ID3 algorithm. There are also many wrapper approaches for SVM classier
[44]. Using weights from the SVM classication model for feature selection was suggested in
[73] and studied in detail for linear SVM in [74]. Brank et al. [75] further generalized this
approach for any linear classier and demonstrated it on linear perceptron and linear SVM.
The linear SVM is often combined with wrapper feature selection, because of its simplicity
and reasonable time complexity. The novel optimization method developed here was also
applied to wrapper feature selection for SVM by Bhondekar et al. [9].
Comparative studies
There have been many comparative studies performed in the area of feature selection. Hall
and Holmes compared six feature selection techniques that produce ranked list of features [57].
They selected features for two classiers - naive Bayes and C4.5 decision tree. They observed
that for accuracy, the wrapper was the best attribute selection scheme, if speed was not an
issue. Palombo and Narsky [88] compared feature selection methods implemented in StatPatternRecognition
package (http://sourceforgeprojects/statpatrec/) and observed
superiority of one wrapper approach known as generalized sequential forward selection. (At
each step the subset with the smallest loss on the test set is selected by adding and removing
features, where the numbers of added and removed features are chosen by the user.) Hua
et al. [89] compared eight lter approaches and two wrapper approaches (SFS and SFFS) on
both synthetic and real data sets with huge number of features and small sample size. They
observed that none of the considered methods performs best across all scenarios, however
they point out some interesting general trends. Filter methods have shown better performance
for small-sample size problems while wrapper methods had better performance when
the sample size was suciently large. Marinaki et al. [64] focused on tabu search guided by
NN classier's performance and compares dierent distance metrics for the classier. They
observed that the city block metric and Euclidean metric show best classication performance
while keeping low computational requirements.
Error estimation
There is a huge number of studies focusing on error estimation. They are mostly focusing
on cross-validation (CV) and bootstrap (BO) techniques. Efron [24] compared leave-one-out
CV, several BO techniques, and some other methods for problems of prediction. He focused
on small sample size and concluded that :632BO performed best. It was also observe that
the leave-one-out technique gives nearly unbiased error estimates with very high variance.
Breiman et al. [90] used ten-fold CV for decision tree pruning and observed that it is satisfactory
to choose the correct tree. Moreover, the dierence in the error estimates for two
tree rules was observed to be much more accurate than the two estimates themselves. Jain
et al. [91] compared the zero BO and leave-one-out CV for NN classiers on articial data and
observed that the BO estimator has smaller condence interval than the leave-one-out. Weiss
[25] compared zero BO estimator, :632 BO estimator, leave-one-out CV, repeated stratied
2CV. For 1NN classier, the repeated stratied 2CV was reported to be a stronger estimator
than the other techniques. This corresponds to the results described by Krzek [26] where the
repeated two-fold CV was reported to lead to the best stability and performance of wrapper
methods. Bailey and Elkan [92] observed for FOIL inducer that :632BO has low variance
and higher bias than leave-one-out CV. Borra and Di Ciaccio [93] compared several CV, BO
and covariance penalty methods on the error estimation for predictors and obtained the best
results for 10CV and the parametric BO estimates.