Prodoma：改善使用深度学习的第三代测序读取的蛋白质域分类

论文标题

Prodoma：改善使用深度学习的第三代测序读取的蛋白质域分类

ProDOMA: improve PROtein DOMAin classification for third-generation sequencing reads using deep learning

论文作者

Nan, Du, Shang, Jiayu, Sun, Yanni

论文摘要

动机：随着第三代测序技术的发展，人们能够获得从10秒至100 kb的长度的DNA序列。这些长读数允许蛋白质结构域注释而无需组装，因此可以对基础数据的生物学功能产生重要的见解。但是，第三代测序数据中的高错误率对已建立的域分析管道提出了新的挑战。最新的方法未针对嘈杂的读取进行优化，并且在第三代测序数据中显示出域分类的准确性不令人满意。仍然需要新的计算方法来改善漫长的嘈杂读取中域预测的性能。结果：在这项工作中，我们引入了Prodoma，这是一个深度学习模型，该模型进行了第三代测序读数的域分类。它使用带有3帧翻译编码的深神经网络从部分正确的翻译中学习保守的功能。此外，我们将问题提出为开放式问题，因此我们的模型可以拒绝无关的DNA读取，例如来自非编码区域的DNA读取。在对蛋白质编码序列的模拟读取和人类基因组中的真实读取的实验中，我们的模型在蛋白质结构域分类上的表现优于Hmmer和DeepFam。总而言之，Prodoma是一个有用的端到端蛋白质结构域分析工具，用于长时间嘈杂的读取，而无需依赖误差校正。可用性：源代码和受过训练的模型可在https://github.com/strideradu/prodoma上自由获取。联系人：[email protected]

Motivation: With the development of third-generation sequencing technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in third-generation sequencing data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in third-generation sequencing data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for third-generation sequencing reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject unrelated DNA reads such as those from noncoding regions. In the experiments on simulated reads of protein coding sequences and real reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. Availability: The source code and the trained model are freely available at https://github.com/strideradu/ProDOMA. Contact: [email protected]

下载PDF全文

下载文献需遵守相关版权规定

论文标题