28-11-2012, 02:25 PM
Software Module Clustering as a Multi-Objective Search Problem
Software Module Clustering.pdf (Size: 7.05 MB / Downloads: 129)
Abstract
Software module clustering is the problem of automatically organizing software units into modules to improve program
structure. There has been a great deal of recent interest in search-based formulations of this problem in which module boundaries are
identified by automated search, guided by a fitness function that captures the twin objectives of high cohesion and low coupling in a
single-objective fitness function. This paper introduces two novel multi-objective formulations of the software module clustering
problem, in which several different objectives (including cohesion and coupling) are represented separately. In order to evaluate the
effectiveness of the multi-objective approach, a set of experiments was performed on 17 real-world module clustering problems. The
results of this empirical study provide strong evidence to support the claim that the multi-objective approach produces significantly
better solutions than the existing single-objective approach.
INTRODUCTION
SOFTWARE module clustering is an important and challenging
problem in software engineering. It is widely
believed that a well-modularized software system is easier
to develop and maintain [8], [27], [29]. Typically, a good
module structure is regarded as one that has a high degree of
cohesion and a low degree of coupling [8], [27], [29]. Sadly,
as software evolves, its modular structure tends to degrade
[17], necessitating a process of restructuring to regain the
cognitive coherence of previous incarnations. This paper is
concerned with automated techniques for suggesting software
clusterings, delimiting boundaries between modules
that maximize cohesion while minimizing coupling.
Many authors have considered the implications of software
modularization on many software engineering concerns.
Badly modularized software is widely regarded as a
source of problems for comprehension, increasing the time
for ongoing maintenance and testing [27], [29], [31]. The use
of cohesion and coupling to assess module structure was
first popularized by the work of Constantine and Yourdon
[8], who introduced a seven-point scale of cohesion and
coupling measurement. These levels of cohesion and their
measurement have formed the topic of much work which
has sought to define metrics to compute them and to assess
their impact on software development [1], [3], [25], [14].
AUTOMATED SOFTWARE MODULE CLUSTERING
Many metaheuristic methods have been successfully applied
to software module clustering. The field was
established by the seminal work of the Drexel group [20].
In this work, hill climbing was the primary search technique
[20], leading to the development of a tool called Bunch [19]
for automated software module clustering.
Several other metaheuristic search technologies have
been applied, including simulated annealing and genetic
algorithms [13], [18], [22]. However, these experiments have
all shown that other techniques are outperformed in both
result quality and execution time by hill climbing.
In order to formulate software engineering problems as
search problems, the representation and fitness function need
to be defined [6], [12]. In the case of module clustering,
previous work has used the Module Dependency Graph
(MDG) as a representation of the problem [20]. The MDG is
represented as a simple array mapping modules (array
indices) to clusters (array elements used to identify clusters)
[20]. The array f2; 2; 3; 2; 4; 4; 2; 3g denotes a clustering of
eight modules into three clusters, identified by the numbers 2,
3, and 4. For example, modules numbered 0, 1, 3, and 6
are all located in the same cluster (which is numbered 2).
The choice of numbers of module identifier is arbitrary, so
this clustering is equivalent to f1; 1; 3; 1; 4; 4; 1; 3g and
f3; 3; 2; 3; 4; 4; 3; 2g.
SOFTWARE MODULE CLUSTERING AS A
MULTI-OBJECTIVE PROBLEM
Existing approaches to the twin objectives of high cohesion
and low coupling have combined these two objectives into a
single-objective function, with all of the drawbacks to which
the introduction of this paper referred. Pareto optimality is
an alternative approach to handling multiple objectives that
retains the character of the problem as a multi-objective
problem. Using Pareto optimality, it is not possible to
measure “how much” better one solution is than another,
merely to determine whether one solution is better than
another. In this way, Pareto optimality combines a set of
measurements into a single ordinal scale metric.
The Equal-Size Cluster Approach
The Equal-size Cluster Approach does not attempt to
optimize for the number of clusters in the modularization.
However, this does not mean that solutions may not emerge
that happen to have a large number of clusters. Rather,
the number of clusters is left as a implicit consequence of
the other optimization objectives, allowing the search
process the freedom to choose any number of clusters
(large or small) that best suits the other explicit objectives.
However, the ECA does attempt to produce a modularization
that contains clusters of roughly equal size, thereby
decomposing the software system into roughly equal-size
modules. This tends to mitigate against small isolated
clusters and also tends to avoid the presence of one larger
“god class”-like structure.
Algorithmic Parameters
The genetic encoding used here employs the same system
as that introduced by Doval et al. for the Bunch system [10].
The crossover operator uses single-point crossover and the
mutation operator uses single-point mutation [10].
Algorithmic parameters are dependent on the number of
modules (N). The probability of crossover is 0.8 if the
population size is less than 100. Otherwise, the probability
is 1.0. The probability of mutation is 0:004log2ðNÞ. The
population size is 10N and the maximum number of
generations is 200N. The total size of archives is equal to the
size of the population.
The hill-climbing algorithm used in the experiments
reported upon here is the Bunch API [19]. The algorithm
uses the hill-climbing method from the Bunch library with
one individual. A first neighbor which produces a better
result is accepted (this is the “first ascent” hill-climbing
approach). The output files (in dot format) were generated
for the detailed level; this output is the lowest level of the
Bunch cluster tree and the one that tends to produce the
highest value of MQ.
CONCLUSION AND FUTURE WORK
This paper introduces the multi-objective approach to
software module clustering. It introduced two multi-objective
formulations of the multi-objective problems and
presented results for the application of these techniques,
comparing the results obtained with those from the existing
single-objective formulation on 17 real-world model clustering
problems
The results indicate that one of the two approaches, the
Equal-size Cluster Approach, is able to produce better
solutions than the existing single-objective solution, both in
terms of the multiple objectives it aims to satisfy, but also in
terms of the single objective upon which all previous work
has focused. This improved performance comes at a
significantly increased computational cost. For problems
in which the software engineer is prepared to wait for the
optimal reclustering of their system, the multi-objective
approach therefore has considerable merit.