论文标题
代码世界:启用用于采矿和分析开源VCS数据宇宙的研究工作流程
World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS data
论文作者
论文摘要
开源软件(OSS)对于现代社会至关重要,尽管对个体(通常是中心)项目进行了大量研究,但对整个OSS生态系统的外围的了解有限。例如,外围互连的数以百万个项目如何。技术依赖性,代码共享或知识流?要回答这样的问题,我们:a)在名为“代码世界”(WOC)的整个牙线生态系统中创建一个非常大的版本控制数据集合,可以完全交叉引用作者,项目,投入,投入,斑点,依赖性以及牙线生态系统和b的历史,并提供有效正确,增强,增强,查询,查询和分析的功能。我们当前的WOC实施能够每月更新,并包含超过18B GIT对象。为了评估其研究潜力并为其使用创建小插图,我们在执行几项研究任务时采用WOC。特别是,我们发现它能够支持趋势评估,生态系统测量以及包装使用情况的确定。我们预计WOC将刺激OSS开发的全球性质调查,从而提高整个OSS生态系统的弹性。我们的基础设施有助于发现关键的技术依赖性,代码流和社交网络,这些依赖性和社交网络为确定驱动牙线活动和创新的关系的结构和演变提供了基础。
Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.