RNA组件流量分解的安全性和完整性

论文标题

RNA组件流量分解的安全性和完整性

Safety and Completeness in Flow Decompositions for RNA Assembly

论文作者

Khan, Shahbaz, Kortelainen, Milla, Cáceres, Manuel, Williams, Lucia, Tomescu, Alexandru I.

论文摘要

将网络流动到加权路径中有许多应用。某些应用需要任何最佳W.R.T.的分解。某些属性，例如路径数量，稳健性或长度。许多生物信息学应用程序需要特定的分解，其中路径对应于产生流量的一些基本数据。对于实际输入，没有优化标准可以保证唯一确定正确的分解。因此，我们建议报告安全路径，即在每个流量分解中至少有一个路径的子路径。 MA，Zheng和Kingsford [Wabi 2020]解决了概率框架中的多个最佳解决方案，即非识别性。后来[Repomb 2021]，他们根据解决一个称为和量化的问题的全球标准提供了二次时算法，该标准概括了报告给定路径是否安全的问题。我们给出了第一个局部表征在有向无环图（DAG）中流动分解的安全路径，从而导致了一种实用算法，用于查找完整的安全路径集。我们评估了针对琐碎的安全算法（Unitigs，扩展的Unitigs）的算法，并评估了RNA转录本数据集中流量分解的普遍使用的启发式（贪婪宽度）。尽管保持了完美的精确度，但我们的算法报告的覆盖范围（$ \ 50 \％$ $ $ $ $ $）都比微不足道的安全算法更高。贪婪的宽度算法尽管报告了更好的覆盖范围，但在复杂图上的精度明显降低。总体而言，当数据集具有相当数量的复杂图时，我们的算法在统一度量（F-SCORE）上胜过（$ \％$ $ 20 \％$）。此外，它具有较高的时间（$ 3-5 \ times $）和空间效率（$ 1.2-2.2 \ times $），从而为流量分解的生物信息学应用提供了更好，更实用的方法。

Decomposing a network flow into weighted paths has numerous applications. Some applications require any decomposition that is optimal w.r.t. some property such as number of paths, robustness, or length. Many bioinformatic applications require a specific decomposition where the paths correspond to some underlying data that generated the flow. For real inputs, no optimization criteria guarantees to uniquely identify the correct decomposition. Therefore, we propose to report safe paths, i.e., subpaths of at least one path in every flow decomposition. Ma, Zheng, and Kingsford [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, i.e., non-identifiability. Later [RECOMB 2021], they gave a quadratic-time algorithm based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe. We give the first local characterization of safe paths for flow decompositions in directed acyclic graphs (DAGs), leading to a practical algorithm for finding the complete set of safe paths. We evaluated our algorithms against the trivial safe algorithms (unitigs, extended unitigs) and the popularly used heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports significantly higher coverage ($\approx 50\%$ more) than trivial safe algorithms. The greedy-width algorithm though reporting a better coverage, has significantly lower precision on complex graphs. Overall, our algorithm outperforms (by $\approx 20\%$) greedy-width on a unified metric (F-Score) when the dataset has significant number of complex graphs. Moreover, it has superior time ($3-5\times$) and space efficiency ($1.2-2.2\times$), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题