论文标题
关于问题和提交的链接预测模型的数据泄漏和概括性的实证研究
An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits
论文作者
论文摘要
为了增强文档和维护实践,开发人员通常会手动之间建立相关软件工件之间的联系。实证研究表明,开发人员经常忽略这种做法,从而导致大量信息丢失。为了解决此问题,已经提出了自动链接恢复技术。但是,这些方法主要集中于提高对随机分类数据集的预测准确性,而对数据泄漏的影响和预测模型的普遍性的关注有限。 LinkFormer试图解决这些限制。我们的方法不仅保留并提高了现有预测的准确性,而且可以增强其与现实环境及其普遍性的一致性。首先,为了更好地利用上下文信息进行预测,我们在问题和提交的文本和元数据信息上都采用了变压器体系结构并微调多个预训练的模型。接下来,为了衡量时间对模型性能的影响,我们在训练和测试阶段都采用了两种分裂政策。随机和时间分布的数据集。最后,为了追求一个通用模型,该模型可以在一系列项目中展示高性能,我们在两个不同的转移学习环境中对LinkFormer进行了其他微调。我们的发现支持为有效模拟现实世界的情况,研究人员必须在训练模型时保持数据的时间流。此外,结果表明,LinkFormer的表现优于现有方法,在基于项目的设置中,F1量化提高了48%。最后,链球投影设置中LinkFormer的性能与基于项目的方案中的平均性能相当。
To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on both textual and metadata information of issues and commits. Next, to gauge the effect of time on model performance, we employ two splitting policies during both the training and testing phases; randomly- and temporally-split datasets. Finally, in pursuit of a generic model that can demonstrate high performance across a range of projects, we undertake additional fine-tuning of LinkFormer within two distinct transfer-learning settings. Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data when training models. Furthermore, the results demonstrate that LinkFormer outperforms existing methodologies by a significant margin, achieving a 48% improvement in F1-measure within a project-based setting. Finally, the performance of LinkFormer in the cross-project setting is comparable to its average performance within the project-based scenario.