论文标题
定量结构活性回归建模的半监督学习框架
A semi-supervised learning framework for quantitative structure-activity regression modelling
论文作者
论文摘要
监督的学习模型,也称为定量结构活性回归(QSAR)模型,越来越多地用于协助临床前小分子药物发现的过程。对模型进行了培训,该数据由分子结构的有限维表示及其相应的目标特定活动组成。然后,这些模型可用于预测先前未衡量的新型化合物的活性。在这项工作中,我们解决了与这种方法有关的两个问题。首先是估计模型预测的质量在多大程度上与训练数据中化合物截然不同的化合物降解。第二个是调整许多培训数据集中固有的筛选选择偏差。在最极端的情况下,仅报告通过活动依赖性筛选的化合物。通过使用半监督的学习框架,我们表明可以做出预测,以考虑测试化合物与培训数据中的相似性的相似性并调整报告选择偏差。我们使用葛兰素史克(Tres Cantos抗疟药套件)报告的大量化合物中使用公开可用的结构活性数据来说明这种方法,以抑制体外的恶性疟原虫的生长。
Supervised learning models, also known as quantitative structure-activity regression (QSAR) models, are increasingly used in assisting the process of preclinical, small molecule drug discovery. The models are trained on data consisting of a finite dimensional representation of molecular structures and their corresponding target specific activities. These models can then be used to predict the activity of previously unmeasured novel compounds. In this work we address two problems related to this approach. The first is to estimate the extent to which the quality of the model predictions degrades for compounds very different from the compounds in the training data. The second is to adjust for the screening dependent selection bias inherent in many training data sets. In the most extreme cases, only compounds which pass an activity-dependent screening are reported. By using a semi-supervised learning framework, we show that it is possible to make predictions which take into account the similarity of the testing compounds to those in the training data and adjust for the reporting selection bias. We illustrate this approach using publicly available structure-activity data on a large set of compounds reported by GlaxoSmithKline (the Tres Cantos AntiMalarial Set) to inhibit in vitro P. falciparum growth.