论文标题

开源数据科学项目中编码标准符合的大规模比较分析

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

论文作者

Simmons, Andrew J., Barnett, Scott, Rivera-Villicana, Jessica, Bajaj, Akshat, Vasa, Rajesh

论文摘要

背景:满足不断增长的行业对数据科学的需求需要跨学科的团队,这些团队可以将机器学习研究转化为生产就绪的代码。软件工程团队重视遵守编码标准,以表明代码可读性,可维护性和开发人员专业知识。但是,没有专门针对数据科学项目的编码标准的大规模经验研究。目的:这项研究调查了数据科学项目遵循代码标准的程度。特别是遵循哪些标准,哪些标准被忽略,这与传统软件项目有何不同?方法:我们将1048个开源数据科学项目的语料库与1099个非DATA科学项目的参考小组进行了比较,质量和成熟度具有相似的水平。结果:数据科学项目的功能率明显更高,这些功能速率使用过多的参数和局部变量。数据科学项目还遵循非数据科学项目的不同变量命名惯例。结论:差异表明数据科学代码库与传统软件代码库不同,并且不遵循传统的软件工程约定。我们的猜想是,这可能是因为在数据科学项目的背景下,传统的软件工程惯例是不合适的。

Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源