25-08-2017, 09:32 PM
Automated Duplicate Detection for Bug Tracking Systems
Automated Duplicate.pdf (Size: 214.81 KB / Downloads: 35)
Abstract
Bug tracking systems are important tools that guide the
maintenance activities of software developers. The utility of
these systems is hampered by an excessive number of duplicate
bug reports–in some projects as many as a quarter of
all reports are duplicates. Developers must manually identify
duplicate bug reports, but this identification process is
time-consuming and exacerbates the already high cost of
software maintenance. We propose a system that automatically
classifies duplicate bug reports as they arrive to save
developer time. This system uses surface features, textual
semantics, and graph clustering to predict duplicate status.
Using a dataset of 29,000 bug reports from the Mozilla
project, we perform experiments that include a simulation
of a real-time bug reporting environment. Our system is
able to reduce development cost by filtering out 8% of duplicate
bug reports while allowing at least one report for
each real defect to reach developers.
Introduction
As software projects become increasingly large and
complex, it becomes more difficult to properly verify code
before shipping. Maintenance activities [12] account for
over two thirds of the life cycle cost of a software system
[3], summing up to a total of $70 billion per year in the
United States [16]. Software maintenance is critical to software
dependability, and defect reporting is critical to modern
software maintenance.
Many software projects rely on bug reports to direct corrective
maintenance activity [1]. In open source software
projects, bug reports are often submitted by software users
or developers and collected in a database by one of several
bug tracking tools. Allowing users to report and potentially
help fix bugs is assumed to improve software quality
overall [13]. Bug tracking systems allow users to report,
describe, track, classify and comment on bug reports and
feature requests.
Motivating Example
Duplicate bug reports are such a problem in practice that
many projects have special guidelines and websites devoted
to them. The “Most Frequently Reported Bugs” page of the
Mozilla Project’s Bugzilla bug tracking system is one such
example. This webpage tracks the number of bug reports
with known duplicates and displays the most commonly reported
bugs. Ten bug equivalence classes have over 100
known duplicates and over 900 other equivalence classes
have more than 10 known duplicates each. All of these duplicates
had to be identified by hand and represent time developers
spent administering the bug report database and
performing triage rather than actually addressing defects.
Related Work
In previous work we presented a model of defect report
quality based only on surface features [6]. That model
predicted whether a bug would be triaged within a given
amount of time. This paper adopts a more semantically-rich
model, including textual information and machine learning
approaches, and is concerned with detecting duplicates
rather than predicting the final status of non-duplicate reports.
In addition, our previous work suffered from false
positives and would occasionally filter away all reports for
a given defect. The technique presented here suffers from
no such false positives in practice on a larger dataset.
Anvik et al. present a system that automatically assigns
bug reports to an appropriate human developer using text
categorization and support vector machines. They claim
that their system could aid a human triager by recommending
a set of developers for each incoming bug report
[2]. Their method correctly suggests appropriate developers
with 64% precision for Firefox, although their
datasets were smaller than ours (e.g., 10,000 Firefox reports)
and their learned model did not generalize well to
other projects. They build on previous approaches to automated
bug assignment with lower precision levels [4, 5].
Our approach is orthogonal to theirs and both might be gainfully
employed together: first our technique filters out potential
duplicates, and then the remaining real bug reports
are assigned to developers using their technique. Anvik et
al. [1] also report preliminary results for duplicate detection
using a combination of cosine similarity and top lists; their
method requires human intervention and incorrectly filtered
out 10% of non-duplicate bugs on their dataset.
Textual Analysis
Bug reports include free-form textual descriptions and titles,
and most duplicate bug reports share many of the same
words. Our first step is to define a textual distance metric
for use on titles and descriptions. We use this metric as a
key component in our identification of duplicates.
We adopt a “bag of words” approach when defining similarity
between textual data. Each text is treated as a set of
words and their frequency: positional information is not retained.
Since orderings are not preserved, some potentiallyimportant
semantic information is not available for later
use. The benefit gained is that the size of the representation
grows at most linearly with the size of the description. This
reduces processing load and is thus desirable for a real-time
system.
Conclusion
We propose a system that automatically classifies duplicate
bug reports as they arrive to save developer time. This
system uses surface features, textual semantics, and graph
clustering to predict duplicate status. We empirically evaluated
our approach using a dataset of 29,000 bug reports
from the Mozilla project, a larger dataset than has generally
previously been reported. We show that inverse document
frequency is not useful in this task, and we simulate using
our model as a filter in a real-time bug reporting environment.
Our system is able to reduce development cost by
filtering out 8% of duplicate bug reports. It still allows at
least one report for each real defect to reach developers, and
spends only 20 seconds per incoming bug report to make
a classification. Thus, a system based upon our approach
could realistically be implemented in a production environment
with little additional effort and a possible non-trivial
payoff.