A genetic algorithm-based rule extraction system pdf

**project girl** · 26-11-2012, 12:48 PM

A genetic algorithm-based rule extraction system

.pdf

A genetic algorithm.pdf (Size: 589.82 KB / Downloads: 19)

Abstract

Individual classifiers predict unknown objects. Although, these are usually domain specific, and lack
the property of scaling up prediction while handling data sets with huge size and high-dimensionality
or imbalance class distribution. This article introduces an accuracy-based learning system called DTGA
(decision tree and genetic algorithm) that aims to improve prediction accuracy over any classification
problem irrespective to domain, size, dimensionality and class distribution. More specifically, the proposed
system consists of two rule inducing phases. In the first phase, a base classifier, C4.5 (a decision tree
based rule inducer) is used to produce rules from training data set, whereas GA (genetic algorithm) in
the next phase refines them with the aim to provide more accurate and high-performance rules for prediction.
The system has been compared with competent non-GA based systems: neural network, Naïve
Bayes, rule-based classifier using rough set theory and C4.5 (i.e., the base classifier of DTGA), on a number
of benchmark datasets collected from UCI (University of California at Irvine) machine learning repository.
Empirical results demonstrate that the proposed hybrid approach provides marked improvement in a
number of cases.

Introduction

In the last decades, one can observe a growing research interest
in the fields of machine learning and knowledge discovery [13].
Recently, the amount of data stored in databases continues to grow
fast. Intuitively, this large amount of stored data contains valuable
hidden knowledge, which could be used to improve the decisionmaking
process of an organization. For instance, data about previous
sales might contain interesting relationships between products and
customers. The discovery of such relationships can be very useful
to increase the sales of a company. However, the number of human
data analysts grows at a much smaller rate than the amount of
stored data. Thus, there is a clear need of (semi-)automatic methods
for extracting knowledge from data, and many data mining algorithms
have been developed in order to extract knowledge from
large data bases [5]. In practice, one of the main tasks considered
in knowledge discovery is supervised classification, where learning
process is provided with a set of training examples of target concepts
(i.e., classes).

DTGA as LCS

J. Holland presented the first architecture of LCS in 1975. Since
then, a number of investigations on its architecture has been performed
to solve classification problem. Although, in 1976, it is split
into two categories depending upon where the GA acts: (i) Pitsburgstyle
LCS in which a population of separate rule sets is considered,
and the GA recombines and reproduces the best of these rule sets,
whereas; (ii) Michigan-style LCS consists of only a single population
and the GA focuses on selecting the best rules within the
rule set. UCS (an accuracy based classifier system) is an example
of Michigan-style LCS, specifically designed for supervised environment.
Basically, it keeps the principal features of XCS (a best
action map based classifier system) [39], but changes the way in
which accuracy is computed. In particular, the GA in XCS is run on
the action sets preferably, whereas the GA in UCS is inherited from
XCS and performs crossover and mutation, selecting two classifiers
from correct set[C] with the probability proportional to the fitness.
The resulting offspring are then inserted in the population, leaving
the incorrect rules. Structurally, many of the design criteria are
optimized in UCS.

Proposed data-splitting technique

Data splitting technique is an important issue in machine learning.
It is really a very difficult task to identify how many instances
are sufficient to gain knowledge for making good decision. Further,
a data set may be imbalance too, i.e., the majority class of the data
set may dominate heavily the minority class. Although, this sort of
data is usually found in medical domain. Experimentally, the imbalance
nature of data set is often reported as an obstacle to the induction
of good classifiers. Some common data-sampling techniques
such as: random over-sampling, random under-sampling, synthetic
minority over-sampling technique (SMOTE), etc. are usually followed
to handle imbalance data set. A brief on each is given below.

Interface

In general, the decision rules (i.e., knowledge) induced from
examples are represented as logical expressions of the form: IF
(conditions) THEN (decision class), where conditions in each rule
are conjunctions of elementary tests on values of attributes, and
decision part indicates the assignment of an object (that satisfies
the condition part) to a given decision class. Thus, each rule can be
viewed as: antecedent
consequent, where antecedent part consists
of conjuncts (i.e., conditions) and consequent is the decision
(i.e., action).
In the present approach, the C4.5 rule induction algorithm
adopted in the first phase extracts rules from training examples.
Each extracted rule is here close to ‘IF–THEN’ form, and it is
shown through the example of golf-playing problem presented in
Appendix B.
Obviously, the rules of such ‘IF–THEN’ format creates interpretability
problem to apply GA. So, to make them more convenient
for applying GA, we prefer here such rules in tabular like form (as
shown in the last part of Appendix B). In particular, the Interface
performs the task of such representation of the rules, eliminating
‘IF–THEN’ part. The discrete values below the names of the
attributes in such a tabular form specify their respective values.
Actually, a ‘*’ symbol appearing in a rule simply says absence
of condition of attribute corresponding to the position of ‘*’ (i.e.,
the attribute corresponding to ‘*’ has no importance in that rule).
Although, ‘*’ value for an attribute (say Ai) in a rule is positioned
following its (i.e., Ai’s) position in the data set. Again, all the nontarget
attributes irrespective to their presence or absence in rule(s)
are herein strictly considered in rules with a view to simpler implementation.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Development of a workflow based Complaint Management System (where the complaints are	mechanical engineering crazy	2	28,844,331	26-11-2018, 12:11 PM Last Post: Guest
	RIA based E- Shopping Portal for Electronic Gadgets Report	study tips	1	1,588	21-09-2017, 01:25 PM Last Post: jaseela123
	System Analysis (Modeling of the Existing and Proposed System using OOD)	seminar flower	1	2,459	15-09-2017, 03:39 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	Uisce: Characteristic-based Routing in Mobile Ad Hoc Networks	project uploader	1	1,721	14-09-2017, 03:30 PM Last Post: jaseela123
	DEVELOPMENT OF A GSM BASED VEHICLE MONITORING & SECURITY SYSTEM	seminar flower	1	1,547	14-09-2017, 10:15 AM Last Post: jaseela123
	Using Rapid Prototyping Data to Enhance a Knowledge-Based Framework for Product Redes	smart paper boy	1	115,120	13-09-2017, 09:54 AM Last Post: jaseela123
	Symmetric Key Cryptography Based Secure Routing In Wireless Sensor Networks	seminar projects maker	1	647	11-09-2017, 12:23 PM Last Post: jaseela123
	SMS Based Student Intimation Report	study tips	1	1,520	09-09-2017, 12:40 PM Last Post: jaseela123
	Web-Based Information System for Blood Donation	seminar ideas	1	2,554	09-09-2017, 10:24 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.