关于软件工程中深度学习的可复制性和可重复性

论文标题

关于软件工程中深度学习的可复制性和可重复性

On the Replicability and Reproducibility of Deep Learning in Software Engineering

论文作者

Liu, Chao, Gao, Cuiyun, Xia, Xin, Lo, David, Grundy, John, Yang, Xiaohu

论文摘要

近年来，深度学习（DL）技术在软件工程（SE）研究人员中广受欢迎。这是因为他们通常可以解决许多SE挑战，而无需大量的手动工程工作和复杂的领域知识。尽管许多DL研究报告了与其他最先进模型有关有效性的实质性优势，但它们通常忽略了两个因素：（1）可复制性 - 与相同的DL模型和相同数据相同的可能性是否可以在高概率上大致复制报告的实验结果；（2）可重复性 - 通过具有相同的实验协议和DL模型的新实验可以重现一个报告的实验发现，但采样了不同的现实世界数据。与传统的机器学习（ML）模型不同，DL研究通常会忽略这两个因素，并将其声明为较小的威胁或将其留给将来的工作。这主要是由于具有许多手动设置参数和耗时的优化过程的高模型复杂性。在这项研究中，我们对最近在二十个SE期刊或会议上发表的93项DL研究进行了文献综述。我们的统计数据表明，在SE中调查这两个因素的紧迫性。此外，我们在SE中重新运行了四个代表性DL模型。实验结果表明，可复制性和可重复性的重要性，在不稳定的优化过程中，DL模型的报告性能无法复制。如果模型训练不是收敛的，或者性能对词汇和测试数据的大小敏感，则可重复性可能会受到基本损害。因此，SE社区迫切需要提供与复制软件包的持久链接，增强基于DL的解决方案稳定性和收敛性，并避免对不同采样数据的性能敏感性。

Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) replicability - whether the reported experimental result can be approximately reproduced in high probability with the same DL model and the same data; and (2) reproducibility - whether one reported experimental findings can be reproduced by new experiments with the same experimental protocol and DL model, but different sampled real-world data. Unlike traditional machine learning (ML) models, DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process. In this study, we conducted a literature review on 93 DL studies recently published in twenty SE journals or conferences. Our statistics show the urgency of investigating these two factors in SE. Moreover, we re-ran four representative DL models in SE. Experimental results show the importance of replicability and reproducibility, where the reported performance of a DL model could not be replicated for an unstable optimization process. Reproducibility could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. It is therefore urgent for the SE community to provide a long-lasting link to a replication package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题