31-03-2014, 03:26 PM
An Empirical Method for Selecting Software Reliability Growth Models
Empirical Method for Selecting.pdf (Size: 518.35 KB / Downloads: 13)
Abstract
Estimating remaining defects (or failures) in software can help test managers make release
decisions during testing. Several methods exist to estimate defect content, among them a variety of
software reliability growth models (SRGMs). SRGMs have underlying assumptions that are often violated
in practice, but empirical evidence has shown that many are quite robust despite these assumption vio-
lations. The problem is that, because of assumption violations, it is often difficult to know which models to
apply in practice.
We present an empirical method for selecting SRGMs to make release decisions. The method provides
guidelines on how to select among the SRGMs to decide on the best model to use as failures are reported
during the test phase. The method applies various SRGMs iteratively during system test. They are fitted to
weekly cumulative failure data and used to estimate the expected remaining number of failures in software
after release. If the SRGMs pass proposed criteria, they may then be used to make release decisions. The
method is applied in a case study using defect reports from system testing of three releases of a large
medical record system to determine how well it predicts the expected total number of failures.
Introduction
Methods that estimate remaining defects (or failures) in software can help test
managers make release decisions during testing. Various estimation models exist to
estimate the expected number of total defects (or failures) or the expected number of
remaining defects (or failures). Static defect estimation models include capture–re-
capture models (Briand et al., 1997, 1998; Eick et al., 1992; Runeson and Wohlin,
1998; Vander Wiel and Votta, 1993; Wohlin and Runeson, 1995, 1998; Yang and
Chao, 1995), curve-fitting methods, such as the Detection Profile Method and the
Cumulative Method (Briand et al., 1998; Wohlin and Runeson, 1998), and experi-
ence-based methods (Biyani and Santhanam, 1998; Yu et al., 1988). Software reli-
ability growth models (SRGMs) (Goel 1985; Goel and Okumoto, 1979; Kececioglu,
1991; Musa and Ackerman, 1989; Musa et al., 1987; Yamada et al., 1983, 1985,
1986) have also been used to estimate remaining failures.
Software Reliability Models
Both static and dynamic software reliability models exist to assess the quality aspect
of software. These models aid in software release decisions (Conte et al., 1986). A
static model uses software metrics, like complexity metrics, results of inspections, etc.
to estimate the number of defects (or faults) in the software. A dynamic model uses
the past failure discovery rate during software execution or cumulative failure profile
over time to estimate the number of failures. It includes a time component, typically
time between failures.
Software Reliablity Models in Practice
Goel discussed the applicability and limitations of SRGMs during the software
development life cycle in Goel (1985). He proposed a step-by-step procedure for
fitting a model and applied the procedure to a real-time command and control
software system. His procedure selects an appropriate model based on an analysis of
the testing process and a model’s assumptions. A model whose assumptions are met
by the testing process is applied to obtain a fitted model. A goodness-of-fit (GOF)
test is performed to check the model fit before obtaining estimates of performance
measures to make decisions about additional testing effort. If the model does not fit,
additional data is collected or a better model is chosen. He does not describe how to
look for a better model. The problem with this method is that, in practice, many of
the models’ assumptions are violated, hence none of the models are appropriate.
Examples of common assumption violations are: performing functional testing
rather than testing an operational profile, varying test effort due to holidays and
vacations, imperfect repair and introduction of new errors, calendar time instead of
execution time, defect reports instead of failure reports, partial or complete scrub-
bing of duplicate defect reports, and failure intervals that are not statistically inde-
pendent of each other, etc.
Data
The failure data come from three releases of a large medical record system, consisting
of 188 software components. Each component contains a number of files. Initially,
the software consisted of 173 software components. All three releases added func-
tionality to the product. Over the three releases, 15 components were added. Between
three and seven new components were added in each release. Many other components
were modified in all three releases as a side effect of the added functionality.
Conclusion
Results show that the selection method based on empirical data works well in
choosing a SRGM that predicts number of failures. The selection method is robust
in the sense that it is is able to adjust to the differences in the data. This enables it to
differentiate between the models: Different models were selected in the releases.
At least one model of those investigated is acceptable by the time testing is
complete enough to consider stopping and releasing the software. In the first and
third release, the S-shaped models performed well in predicting the total number of
failures. These two releases had defect data that exhibited an S-shape. The data in
Release 2 was concave, rather than S-shaped. It is no surprise that the S-shaped
models did not perform well on this data. The Yamada exponential model, however,
performed very well on the data from Release 2. (Other concave models underpredict
the total number of failures.)
SRGMs may provide good predictions of the total number of failures or the
number of remaining failures. Wood’s empirical study (Wood, 1996) has shown that
predictions from simple models of cumulative defects based on execution time cor-
relate well with field data. In our study, predictions from simple models based on
calendar time correlate well with data from our environment. The empirical selection
method described in this paper helped in choosing an appropriate model. We would
like to caution, though, that this does not guarantee universal success. It is always
useful to employ complementary techniques for assessment, as for example in
Stringfellow (2000).