Dibella：分布式长阅读到长阅读对齐

论文标题

Dibella：分布式长阅读到长阅读对齐

diBELLA: Distributed Long Read to Long Read Alignment

论文作者

Ellis, Marquita, Guidi, Giulia, Buluç, Aydın, Oliker, Leonid, Yelick, Katherine

论文摘要

我们提出了一种平行的算法和可扩展的基因组分析实现，特别是发现“第三代”长读序列的数据重叠和对齐的问题。尽管长长的DNA序列为生物分析和洞察力提供了巨大的优势，但当前的长读取测序仪器具有很高的错误率，因此与简短读取的对应物相比，需要不同的分析方法。我们的工作着重于精确的单节算法的有效分布式内存并行，以重叠和对齐长读数。我们通过解决增加并行性，最小化沟通，限制内存足迹并确保良好的负载平衡的竞争问题，从而实现了这种不规则算法的可伸缩性。最终的应用Dibella是专为长读取和并行可扩展性而设计的第一个分布式内存重叠器和对齐器。我们描述并提出了高级设计权衡的分析，并进行了广泛的经验分析，该分析比较了最先进的HPC系统的性能特征以及商业云体系结构，从而突出了先进的网络技术的优势。

We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题