论文标题

ontrototoin:蛋白质预测基因本体学嵌入

OntoProtein: Protein Pretraining With Gene Ontology Embedding

论文作者

Zhang, Ningyu, Bi, Zhen, Liang, Xiaozhuan, Cheng, Siyuan, Hong, Haosen, Deng, Shumin, Lian, Jiazhang, Zhang, Qiang, Chen, Huajun

论文摘要

自我监督的蛋白质语言模型已证明了它们在学习蛋白质表示方面的有效性。随着计算能力的增加,通过数百万个不同序列预先培训的当前蛋白质语言模型可以将参数量表从百万级降低到十亿级,并取得显着的改进。但是,那些流行的方法很少考虑合并知识图(kg),这些方法可以为更好的蛋白质表示提供丰富的结构化知识事实。我们认为,KGS中的信息性生物学知识可以通过外部知识增强蛋白质的表示。在这项工作中,我们提出了植物蛋白,这是将结构(基因本体论)中使用到蛋白质预训练模型中的第一个通用框架。我们构建了一个新型的大规模知识图,该图由GO及其相关蛋白质组成,基因注释文本或蛋白质序列描述了图中的所有节点。我们提出了新颖的对比度学习,并通过知识吸引的负抽样来共同优化预训练期间的知识图和蛋白质嵌入。实验结果表明,与蛋白质 - 蛋白质相互作用和蛋白质功能预测的基准相比,具有预训练的蛋白质语言模型可以超过预训练的蛋白质语言模型的最先进方法,并产生更好的性能。代码和数据集可在https://github.com/zjunlp/ontoprotoin中找到。

Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源