论文标题
NGS数据分析中科学工作流的可移植性:案例研究
Portability of Scientific Workflows in NGS Data Analysis: A Case Study
论文作者
论文摘要
下一代测序(NGS)数据的分析需要复杂的计算工作流程,包括数十个自主开发但相互依存的处理步骤。每当需要处理大量数据时,这些工作流程必须在并行和/或分布式系统上执行,以确保合理的运行时。将为特定系统开发的工作流放置在特定硬件基础架构上的另一系统或另一个基础架构上是不平凡的,这对工作流可重复可重复性和工作流可重复使用的科学必需品产生了重大障碍。在这项工作中,我们描述了我们为在小鼠全异位测序中检测特定变体的最先进工作流提供的努力。该工作流程最初是在科学工作流系统Snakemake中开发的,用于在由Sun Grid Engine控制的高性能群集上执行。在项目中,我们将其移植到了科学工作流系统Saasfee上,该系统可以在(多核)独立服务器或使用Hadoop的任意大小的群集上执行工作流程。该端口的目的是,为Hadoop制定的低成本硬件基础架构的所有者能够使用工作流程。尽管源和目标系统都称为科学工作流程系统,但它们在许多方面有所不同,从工作流语言到调度机制和文件访问接口。这些差异导致了各种问题,有些期望和更出乎意料,这些问题必须在工作流程以相等的语义运行之前得到解决。作为副作用,我们还报告了在非常不同的硬件平台上最先进的NGS工作流程的成本/运行时间比:一家价格便宜的独立服务器(80个线程),中型,中型中型群集(552个线程)和高端HPC系统(3784个线程)。
The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a workflow developed for a particular system on a particular hardware infrastructure to another system or to another infrastructure is non-trivial, which poses a major impediment to the scientific necessities of workflow reproducibility and workflow reusability. In this work, we describe our efforts to port a state-of-the-art workflow for the detection of specific variants in whole-exome sequencing of mice. The workflow originally was developed in the scientific workflow system snakemake for execution on a high-performance cluster controlled by Sun Grid Engine. In the project, we ported it to the scientific workflow system SaasFee that can execute workflows on (multi-core) stand-alone servers or on clusters of arbitrary sizes using the Hadoop. The purpose of this port was that also owners of low-cost hardware infrastructures, for which Hadoop was made for, become able to use the workflow. Although both the source and the target system are called scientific workflow systems, they differ in numerous aspects, ranging from the workflow languages to the scheduling mechanisms and the file access interfaces. These differences resulted in various problems, some expected and more unexpected, that had to be resolved before the workflow could be run with equal semantics. As a side-effect, we also report cost/runtime ratios for a state-of-the-art NGS workflow on very different hardware platforms: A comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster (552 threads), and a high-end HPC system (3784 threads).