26-11-2012, 12:48 PM
A genetic algorithm-based rule extraction system
A genetic algorithm.pdf (Size: 589.82 KB / Downloads: 19)
Abstract
Individual classifiers predict unknown objects. Although, these are usually domain specific, and lack
the property of scaling up prediction while handling data sets with huge size and high-dimensionality
or imbalance class distribution. This article introduces an accuracy-based learning system called DTGA
(decision tree and genetic algorithm) that aims to improve prediction accuracy over any classification
problem irrespective to domain, size, dimensionality and class distribution. More specifically, the proposed
system consists of two rule inducing phases. In the first phase, a base classifier, C4.5 (a decision tree
based rule inducer) is used to produce rules from training data set, whereas GA (genetic algorithm) in
the next phase refines them with the aim to provide more accurate and high-performance rules for prediction.
The system has been compared with competent non-GA based systems: neural network, Naïve
Bayes, rule-based classifier using rough set theory and C4.5 (i.e., the base classifier of DTGA), on a number
of benchmark datasets collected from UCI (University of California at Irvine) machine learning repository.
Empirical results demonstrate that the proposed hybrid approach provides marked improvement in a
number of cases.
Introduction
In the last decades, one can observe a growing research interest
in the fields of machine learning and knowledge discovery [13].
Recently, the amount of data stored in databases continues to grow
fast. Intuitively, this large amount of stored data contains valuable
hidden knowledge, which could be used to improve the decisionmaking
process of an organization. For instance, data about previous
sales might contain interesting relationships between products and
customers. The discovery of such relationships can be very useful
to increase the sales of a company. However, the number of human
data analysts grows at a much smaller rate than the amount of
stored data. Thus, there is a clear need of (semi-)automatic methods
for extracting knowledge from data, and many data mining algorithms
have been developed in order to extract knowledge from
large data bases [5]. In practice, one of the main tasks considered
in knowledge discovery is supervised classification, where learning
process is provided with a set of training examples of target concepts
(i.e., classes).
DTGA as LCS
J. Holland presented the first architecture of LCS in 1975. Since
then, a number of investigations on its architecture has been performed
to solve classification problem. Although, in 1976, it is split
into two categories depending upon where the GA acts: (i) Pitsburgstyle
LCS in which a population of separate rule sets is considered,
and the GA recombines and reproduces the best of these rule sets,
whereas; (ii) Michigan-style LCS consists of only a single population
and the GA focuses on selecting the best rules within the
rule set. UCS (an accuracy based classifier system) is an example
of Michigan-style LCS, specifically designed for supervised environment.
Basically, it keeps the principal features of XCS (a best
action map based classifier system) [39], but changes the way in
which accuracy is computed. In particular, the GA in XCS is run on
the action sets preferably, whereas the GA in UCS is inherited from
XCS and performs crossover and mutation, selecting two classifiers
from correct set[C] with the probability proportional to the fitness.
The resulting offspring are then inserted in the population, leaving
the incorrect rules. Structurally, many of the design criteria are
optimized in UCS.
Proposed data-splitting technique
Data splitting technique is an important issue in machine learning.
It is really a very difficult task to identify how many instances
are sufficient to gain knowledge for making good decision. Further,
a data set may be imbalance too, i.e., the majority class of the data
set may dominate heavily the minority class. Although, this sort of
data is usually found in medical domain. Experimentally, the imbalance
nature of data set is often reported as an obstacle to the induction
of good classifiers. Some common data-sampling techniques
such as: random over-sampling, random under-sampling, synthetic
minority over-sampling technique (SMOTE), etc. are usually followed
to handle imbalance data set. A brief on each is given below.
Interface
In general, the decision rules (i.e., knowledge) induced from
examples are represented as logical expressions of the form: IF
(conditions) THEN (decision class), where conditions in each rule
are conjunctions of elementary tests on values of attributes, and
decision part indicates the assignment of an object (that satisfies
the condition part) to a given decision class. Thus, each rule can be
viewed as: antecedent
consequent, where antecedent part consists
of conjuncts (i.e., conditions) and consequent is the decision
(i.e., action).
In the present approach, the C4.5 rule induction algorithm
adopted in the first phase extracts rules from training examples.
Each extracted rule is here close to ‘IF–THEN’ form, and it is
shown through the example of golf-playing problem presented in
Appendix B.
Obviously, the rules of such ‘IF–THEN’ format creates interpretability
problem to apply GA. So, to make them more convenient
for applying GA, we prefer here such rules in tabular like form (as
shown in the last part of Appendix B). In particular, the Interface
performs the task of such representation of the rules, eliminating
‘IF–THEN’ part. The discrete values below the names of the
attributes in such a tabular form specify their respective values.
Actually, a ‘*’ symbol appearing in a rule simply says absence
of condition of attribute corresponding to the position of ‘*’ (i.e.,
the attribute corresponding to ‘*’ has no importance in that rule).
Although, ‘*’ value for an attribute (say Ai) in a rule is positioned
following its (i.e., Ai’s) position in the data set. Again, all the nontarget
attributes irrespective to their presence or absence in rule(s)
are herein strictly considered in rules with a view to simpler implementation.