07-09-2017, 09:36 AM
Error tracking systems are important tools that guide the maintenance activities of software developers. The utility of these systems is hampered by an excessive number of reports of duplicate errors in some projects, as a quarter of all reports are duplicates. Developers must manually identify duplicate error reports, but this identification process takes a long time and exacerbates the already high cost of software maintenance. We propose a system that automatically classifies duplicate error reports as they arrive to save developer time. This system uses surface characteristics, textual semantics and graphic grouping to predict the duplicate state. Using a data set of 29,000 bug reports from the Mozilla project, we perform experiments that include a simulation of a real-time error reporting environment. Our system is able to reduce development cost by filtering out 8% of duplicate error reports, while allowing at least one report of each real defect to reach developers.