论文标题
计算与R-Index的最大唯一匹配
Computing Maximal Unique Matches with the r-index
论文作者
论文摘要
近年来,pangenomes因其纳入人群变异信息和减轻参考基因组偏见的能力而受到了科学界的越来越多的关注。最大精确匹配(MEM)和最大独特匹配(MUM)已证明自己在多种生物信息学上下文中很有用,例如短阅读对准和多基因组对齐。但是,使用后缀树和FM索引的标准技术不会扩展到骨量。最近,Gagie等。 [JACM 20]引入了$ r $ index,这是一个基于洞穴的轮毂变换(BWT)的索引,能够处理数百个人类基因组。后来,Rossi等。 [JCB 22]使用$ r $ index和Boucher等人启用了MEMS的计算。 [DCC 21]展示了如何以流媒体方式计算它们。在本文中,我们展示了如何增强Boucher等人的方法来启用$ R $索引上妈妈的计算,同时保留空间和时间范围。我们添加了最长的常见前缀(LCP)阵列的其他$ O(R)$样本,其中$ r $是BWT的同等字母运行的数量,它允许计算模式后缀与输入文本的第二最长匹配,这又允许计算候选Mums的计算。我们实施了对方法的概念验证,我们称为妈妈 - 纯电指导,并在现实世界数据集上进行了测试。我们将我们的方法与能够计算妈妈的竞争方法进行了比较。我们观察到我们的方法的较小8倍,而当数据集不高度重复时,速度慢了19倍,而在高度重复的数据上,我们的方法的速度慢了6.5倍,并且使用了25倍的记忆。
In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the $r$-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the $r$-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the $r$-index, while preserving the space and time bounds. We add additional $O(r)$ samples of the longest common prefix (LCP) array, where $r$ is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.