论文标题
数据分析管道中的数值扰动的基于文件的定位
File-based localization of numerical perturbations in data analysis pipelines
论文作者
论文摘要
已知数据分析管道受到计算条件的影响,这可能是由于数值错误的创建和传播所致。尽管此过程可能在当前的可重复性危机中起主要作用,但这种不稳定性的确切原因和它们在管道中传播的路径尚不清楚。我们提出位置,该工具是在管道中确定哪些过程在不同计算条件下执行的数值差异的工具。通过Reprozip利用系统通话拦截来重建和比较没有管道仪器的出处图。通过将斑点应用于人类Connectome项目的结构预处理管道,我们发现线性和非线性注册是这些管道中数值大多数不稳定性的原因,这证实了先前的发现。
Data analysis pipelines are known to be impacted by computational conditions, presumably due to the creation and propagation of numerical errors. While this process could play a major role in the current reproducibility crisis, the precise causes of such instabilities and the path along which they propagate in pipelines are unclear. We present Spot, a tool to identify which processes in a pipeline create numerical differences when executed in different computational conditions. Spot leverages system-call interception through ReproZip to reconstruct and compare provenance graphs without pipeline instrumentation. By applying Spot to the structural pre-processing pipelines of the Human Connectome Project, we found that linear and non-linear registration are the cause of most numerical instabilities in these pipelines, which confirms previous findings.