数字信号处理和组装预测模型的组合有助于蛋白质的合理设计

论文标题

数字信号处理和组装预测模型的组合有助于蛋白质的合理设计

Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins

论文作者

Medina-Ortiz, David, Contreras, Sebastian, Amado-Hinojosa, Juan, Torres-Almonacid, Jorge, Asenjo, Juan A., Navarrete, Marcelo, Olivera-Nappa, Álvaro

论文摘要

预测蛋白质突变的作用是蛋白质工程中最关键的挑战之一。通过了解蛋白质序列中一个（或几个）残基的替代对其整体特性的影响，可以设计具有理想功能的变体。创建预测模型的新策略和方法正在不断开发。但是，那些声称是一般性的人通常无法达到足够的绩效，而那些旨在以特定任务为特定任务的人以方法的一般性来提高其预测性能。此外，这些方法通常需要特定的决定来编码氨基酸性序列，而在这种努力中没有明确的方法论一致。为了解决这些问题，在这项工作中，我们将群集，嵌入和降低降低技术应用于AAINDEX数据库，以选择编码阶段的物理化学属性的有意义组合。然后，我们使用选定的属性集获得同一序列的几个编码，然后将快速傅立叶变换（FFT）应用于它们。我们使用不同的算法和超参数在频率空间中执行机器学习模型的探索性阶段。最后，我们在每组属性中选择最佳性能预测模型，并创建一个组装模型。我们在不同数据集上广泛测试了所提出的方法，并证明了生成的组装模型比基于单个编码的模型获得了明显更好的性能指标，并且在大多数情况下，比起先前报道的模型更好。根据GNU通用公共许可证（GPLV3）许可证，该方法可作为python库以非商业用途的使用。

Predicting the effect of mutations in proteins is one of the most critical challenges in protein engineering; by knowing the effect a substitution of one (or several) residues in the protein's sequence has on its overall properties, could design a variant with a desirable function. New strategies and methodologies to create predictive models are continually being developed. However, those that claim to be general often do not reach adequate performance, and those that aim to a particular task improve their predictive performance at the cost of the method's generality. Moreover, these approaches typically require a particular decision to encode the amino acidic sequence, without an explicit methodological agreement in such endeavor. To address these issues, in this work, we applied clustering, embedding, and dimensionality reduction techniques to the AAIndex database to select meaningful combinations of physicochemical properties for the encoding stage. We then used the chosen set of properties to obtain several encodings of the same sequence, to subsequently apply the Fast Fourier Transform (FFT) on them. We perform an exploratory stage of Machine-Learning models in the frequency space, using different algorithms and hyperparameters. Finally, we select the best performing predictive models in each set of properties and create an assembled model. We extensively tested the proposed methodology on different datasets and demonstrated that the generated assembled model achieved notably better performance metrics than those models based on a single encoding and, in most cases, better than those previously reported. The proposed method is available as a Python library for non-commercial use under the GNU General Public License (GPLv3) license.

下载PDF全文

下载文献需遵守相关版权规定

论文标题