论文标题
利用针对性药物设计的预识别的生化语言模型
Exploiting Pretrained Biochemical Language Models for Targeted Drug Design
论文作者
论文摘要
动机:针对感兴趣的蛋白质的新颖化合物的发展是制药行业中最重要的任务之一。深层生成模型已应用于靶向分子设计,并显示出令人鼓舞的结果。最近,靶标特异性分子的产生被视为蛋白质语言与化学语言之间的翻译。但是,这种模型受到相互作用蛋白质配对的可用性的限制。另一方面,可以使用大量未标记的蛋白质序列和化学化合物,并用于训练学习有用表示的语言模型。在这项研究中,我们提出了利用预审核的生化语言模型以初始化(即温暖的开始)靶向分子产生模型。我们研究了两种温暖的开始策略:(i)一种单阶段策略,其中初始化模型是针对目标分子生成(ii)的两阶段策略进行培训的,该策略包含对分子生成的预处理,然后进行目标特定训练。我们还比较了两种解码策略来生成化合物:梁搜索和采样。 结果:结果表明,温暖启动的模型的性能优于从头开始训练的基线模型。相对于基准广泛使用的指标,这两种拟议的温暖启动策略相互取得了相似的结果。但是,对许多新型蛋白质生成的化合物进行对接评估表明,单阶段策略比两阶段策略更好地推广。此外,我们观察到,在对接评估和基准测量标准中,梁搜索的表现优于采样,用于评估复合质量。 可用性和实施:源代码可在https://github.com/boun-tabi/biochemical-lms-for-drug-design和材料中列为Zenodo,网址为https://doi.org/10.5281/zenodo.6832145.6832145
Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabeled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. Results: The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. Availability and implementation: The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145