论文标题
归因方法揭示了基于指纹的虚拟筛选中的缺陷
Attribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening
论文作者
论文摘要
基于指纹的蛋白质配体结合模型在基准数据集上表现出了出色的成功。但是,这些模型可能无法学习正确的绑定规则。为了评估这一问题,我们将其用于具有已知约束规则的硅数据集中,以开发评估模型归因的一般框架。该框架确定了模型认为要达到特定分数所需的片段,从而避免了模型可区分的需求。我们的结果证实,高性能模型可能无法学习正确的约束规则,并提出可以纠正这种情况的具体步骤。我们表明,将碎片匹配的无活性分子(诱饵)添加到数据中会减少归因假阴性,而归因误报很大程度上来自分子数据的背景相关结构。这些背景相关性的标准化有助于揭示真正的结合逻辑。我们的工作突出了信任高性能模型的归因的危险,并表明对指纹相关结构和更好的诱饵选择的仔细检查可能有助于减少错误贡献。
Fingerprint-based models for protein-ligand binding have demonstrated outstanding success on benchmark datasets; however, these models may not learn the correct binding rules. To assess this concern, we use in silico datasets with known binding rules to develop a general framework for evaluating model attribution. This framework identifies fragments that a model considers necessary to achieve a particular score, sidestepping the need for a model to be differentiable. Our results confirm that high-performing models may not learn the correct binding rule, and suggest concrete steps that can remedy this situation. We show that adding fragment-matched inactive molecules (decoys) to the data reduces attribution false negatives, while attribution false positives largely arise from the background correlation structure of molecular data. Normalizing for these background correlations helps to reveal the true binding logic. Our work highlights the danger of trusting attributions from high-performing models and suggests that a closer examination of fingerprint correlation structure and better decoy selection may help reduce misattributions.