Automated Duplicate Detection for Bug Tracking Systems

**seminar flower** · 25-08-2017, 09:32 PM

Automated Duplicate Detection for Bug Tracking Systems

.pdf

Automated Duplicate.pdf (Size: 214.81 KB / Downloads: 35)

Abstract

Bug tracking systems are important tools that guide the
maintenance activities of software developers. The utility of
these systems is hampered by an excessive number of duplicate
bug reports–in some projects as many as a quarter of
all reports are duplicates. Developers must manually identify
duplicate bug reports, but this identification process is
time-consuming and exacerbates the already high cost of
software maintenance. We propose a system that automatically
classifies duplicate bug reports as they arrive to save
developer time. This system uses surface features, textual
semantics, and graph clustering to predict duplicate status.
Using a dataset of 29,000 bug reports from the Mozilla
project, we perform experiments that include a simulation
of a real-time bug reporting environment. Our system is
able to reduce development cost by filtering out 8% of duplicate
bug reports while allowing at least one report for
each real defect to reach developers.

Introduction

As software projects become increasingly large and
complex, it becomes more difficult to properly verify code
before shipping. Maintenance activities [12] account for
over two thirds of the life cycle cost of a software system
[3], summing up to a total of $70 billion per year in the
United States [16]. Software maintenance is critical to software
dependability, and defect reporting is critical to modern
software maintenance.
Many software projects rely on bug reports to direct corrective
maintenance activity [1]. In open source software
projects, bug reports are often submitted by software users
or developers and collected in a database by one of several
bug tracking tools. Allowing users to report and potentially
help fix bugs is assumed to improve software quality
overall [13]. Bug tracking systems allow users to report,
describe, track, classify and comment on bug reports and
feature requests.

Motivating Example

Duplicate bug reports are such a problem in practice that
many projects have special guidelines and websites devoted
to them. The “Most Frequently Reported Bugs” page of the
Mozilla Project’s Bugzilla bug tracking system is one such
example. This webpage tracks the number of bug reports
with known duplicates and displays the most commonly reported
bugs. Ten bug equivalence classes have over 100
known duplicates and over 900 other equivalence classes
have more than 10 known duplicates each. All of these duplicates
had to be identified by hand and represent time developers
spent administering the bug report database and
performing triage rather than actually addressing defects.

Related Work

In previous work we presented a model of defect report
quality based only on surface features [6]. That model
predicted whether a bug would be triaged within a given
amount of time. This paper adopts a more semantically-rich
model, including textual information and machine learning
approaches, and is concerned with detecting duplicates
rather than predicting the final status of non-duplicate reports.
In addition, our previous work suffered from false
positives and would occasionally filter away all reports for
a given defect. The technique presented here suffers from
no such false positives in practice on a larger dataset.
Anvik et al. present a system that automatically assigns
bug reports to an appropriate human developer using text
categorization and support vector machines. They claim
that their system could aid a human triager by recommending
a set of developers for each incoming bug report
[2]. Their method correctly suggests appropriate developers
with 64% precision for Firefox, although their
datasets were smaller than ours (e.g., 10,000 Firefox reports)
and their learned model did not generalize well to
other projects. They build on previous approaches to automated
bug assignment with lower precision levels [4, 5].
Our approach is orthogonal to theirs and both might be gainfully
employed together: first our technique filters out potential
duplicates, and then the remaining real bug reports
are assigned to developers using their technique. Anvik et
al. [1] also report preliminary results for duplicate detection
using a combination of cosine similarity and top lists; their
method requires human intervention and incorrectly filtered
out 10% of non-duplicate bugs on their dataset.

Textual Analysis

Bug reports include free-form textual descriptions and titles,
and most duplicate bug reports share many of the same
words. Our first step is to define a textual distance metric
for use on titles and descriptions. We use this metric as a
key component in our identification of duplicates.
We adopt a “bag of words” approach when defining similarity
between textual data. Each text is treated as a set of
words and their frequency: positional information is not retained.
Since orderings are not preserved, some potentiallyimportant
semantic information is not available for later
use. The benefit gained is that the size of the representation
grows at most linearly with the size of the description. This
reduces processing load and is thus desirable for a real-time
system.

Conclusion

We propose a system that automatically classifies duplicate
bug reports as they arrive to save developer time. This
system uses surface features, textual semantics, and graph
clustering to predict duplicate status. We empirically evaluated
our approach using a dataset of 29,000 bug reports
from the Mozilla project, a larger dataset than has generally
previously been reported. We show that inverse document
frequency is not useful in this task, and we simulate using
our model as a filter in a real-time bug reporting environment.
Our system is able to reduce development cost by
filtering out 8% of duplicate bug reports. It still allows at
least one report for each real defect to reach developers, and
spends only 20 seconds per incoming bug report to make
a classification. Thus, a system based upon our approach
could realistically be implemented in a production environment
with little additional effort and a possible non-trivial
payoff.

**jaseela123** · 07-09-2017, 09:36 AM

Error tracking systems are important tools that guide the maintenance activities of software developers. The utility of these systems is hampered by an excessive number of reports of duplicate errors in some projects, as a quarter of all reports are duplicates. Developers must manually identify duplicate error reports, but this identification process takes a long time and exacerbates the already high cost of software maintenance. We propose a system that automatically classifies duplicate error reports as they arrive to save developer time. This system uses surface characteristics, textual semantics and graphic grouping to predict the duplicate state. Using a data set of 29,000 bug reports from the Mozilla project, we perform experiments that include a simulation of a real-time error reporting environment. Our system is able to reduce development cost by filtering out 8% of duplicate error reports, while allowing at least one report of each real defect to reach developers.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Automated exam cell application software	mkaasees	1	1,040	08-01-2018, 10:03 AM Last Post: Raymondnof
	MANAGEMENT INFORMATION SYSTEMS	seminar ideas	1	2,282	19-09-2017, 04:45 PM Last Post: jaseela123
	Automated Toll Collection With Complex System	seminar code	1	643	14-09-2017, 02:18 PM Last Post: jaseela123
	Efficient Code-based Relay Algorithm for Reliable Cognitive Radio Systems pdf	project girl	1	1,187	08-09-2017, 12:01 PM Last Post: jaseela123
	Analysis of an Anomaly-based Intrusion Detection System for Wireless Sensor Networks	study tips	1	1,049	06-09-2017, 01:16 PM Last Post: jaseela123
	ONLINE OBJECT DETECTION	project topics	1	67,280	06-09-2017, 01:12 PM Last Post: jaseela123
	Misuse Detection and Monitoring ORACLE full report	computer science technology	1	12,909,384	02-09-2017, 03:13 PM Last Post: jaseela123
	ATTRIBUTE-BASED ACCESS CONTROL WITH EFFICIENT REVOCATION IN DATA OUTSOURCING SYSTEMS	seminar tips	1	980	31-08-2017, 12:50 PM Last Post: jaseela123
	Spatio-Temporal Network Anomaly Detection by Assessing Deviations of Empirical Measur	computer science topics	1	7,177,927	28-08-2017, 04:05 PM Last Post: jaseela123
	Embedded Systems	mkaasees	0	219	25-08-2017, 09:32 PM Last Post: mkaasees

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.