堆栈的聚合跟踪崩溃报告重复数据删除的相似之处

论文标题

堆栈的聚合跟踪崩溃报告重复数据删除的相似之处

Aggregation of Stack Trace Similarities for Crash Report Deduplication

论文作者

Karasov, Nikolay, Khvorov, Aleksandr, Vasiliev, Roman, Golubev, Yaroslav, Bryksin, Timofey

论文摘要

错误跟踪系统中堆栈跟踪的自动集合是许多软件项目及其维护的组成部分。但是，这样的报告通常包含许多重复项，并且将它们归类为分组的问题出现了。在本文中，我们提出了一种新方法来解决重复数据删除任务，并报告其对IDE和其他软件的领先开发人员JetBrains的现实数据的使用。与大多数现有方法不同，将传入的堆栈跟踪分配给一个最相似的堆栈跟踪所在的特定组，我们使用有关该组的所有相似性的信息，以及有关堆栈痕迹时间戳的信息。与现有解决方案相比，这种汇总所有可用信息的方法显示出明显更好的结果。在现有NetBeans数据集中，召回率TOP-1指标中的汇总率提高了最先进的解决方案的结果，而Jetbrains数据上的结果则提高了8个百分点。此外，我们评估了一种更简单的k-nearen邻居方法，并表明它无法达到相同水平的改进。最后，我们研究了聚合的功能最大的贡献，以提高质量，以了解其中的哪些能力进一步发展。我们发布了建议的方法的实施，并将在接受后发布新收集的工业数据集，以促进该地区的进一步研究。

The automatic collection of stack traces in bug tracking systems is an integral part of many software projects and their maintenance. However, such reports often contain a lot of duplicates, and the problem of de-duplicating them into groups arises. In this paper, we propose a new approach to solve the deduplication task and report on its use on the real-world data from JetBrains, a leading developer of IDEs and other software. Unlike most of the existing methods, which assign the incoming stack trace to a particular group in which a single most similar stack trace is located, we use the information about all the calculated similarities to the group, as well as the information about the timestamp of the stack traces. This approach to aggregating all available information shows significantly better results compared to existing solutions. The aggregation improved the results over the state-of-the-art solutions by 15 percentage points in the Recall Rate Top-1 metric on the existing NetBeans dataset and by 8 percentage points on the JetBrains data. Additionally, we evaluated a simpler k-Nearest Neighbors approach to aggregation and showed that it cannot reach the same levels of improvement. Finally, we studied what features from the aggregation contributed the most towards better quality to understand which of them to develop further. We publish the implementation of the suggested approach, and will release the newly collected industrial dataset upon acceptance to facilitate further research in the area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题