19-04-2014, 12:08 PM
On the application of genetic programming for software engineering predictive modeling: A systematic review
genetic programming.pdf (Size: 1,019.95 KB / Downloads: 26)
Abstract
The objective of this paper is to investigate the evidence for symbolic regression using genetic program-
ming (GP) being an effective method for prediction and estimation in software engineering, when com-
pared with regression/machine learning models and other comparison groups (including comparisons
with different improvements over the standard GP algorithm). We performed a systematic review of lit-
erature that compared genetic programming models with comparative techniques based on different
independent project variables. A total of 23 primary studies were obtained after searching different infor-
mation sources in the time span 1995–2008. The results of the review show that symbolic regression
using genetic programming has been applied in three domains within software engineering predictive
modeling: (i) Software quality classification (eight primary studies). (ii) Software cost/effort/size estima-
tion (seven primary studies). (iii) Software fault prediction/software reliability growth modeling (eight
primary studies). While there is evidence in support of using genetic programming for software quality
classification, software fault prediction and software reliability growth modeling; the results are incon-
clusive for software cost/effort/size estimation.
Introduction
Evolutionary algorithms represent a subset of the metaheuristic
approaches inspired by evolution in nature, (Burke & Kendall,
2005) such as reproduction, mutation, cross-over, natural selection
and survival of the fittest. All evolutionary algorithms share a set of
Q2 common properties (Bäck, Fogel, & Michalewicz, 2000):
1. These algorithms work with a population of solutions, utilizing
a collective learning process. This population of solutions make-
up the search space for the evolutionary algorithms.
2. The solutions are evaluated by means of a quality or fitness
value whereby the selection process promotes better solutions
than those that are worse.
3. New solutions are generated by random variation operators
intended to model mutation and recombination.
Software quality classification
Our literature search found 8 studies on the application of sym-
bolic regression using GP for software quality classification. Six out
of these eight studies were co-authored by similar authors to a
large extent, where one author was found to be part of each of
these six studies. The data sets also over-lapped between studies
which provides an indication that the conclusion of these studies
were tied to the nature of the data sets used. However, these seven
studies were marked with different variations of the GP fitness
function and also used different comparison groups. This in our
opinion indicates distinct contribution and thus worthy of inclu-
sion as primary studies for this review. The importance of good fit-
ness functions is also highlighted by Harman (2007): ‘‘. . . no matter
what search technique is employed, it is the fitness function that cap-
tures the crucial information; it differentiates a good solution from a
poor one, thereby guiding the search.’’