跨化学空间的13C NMR屏蔽预测：原子中的原子内核机器学习，并使用134千公斤分子的新数据

论文标题

跨化学空间的13C NMR屏蔽预测：原子中的原子内核机器学习，并使用134千公斤分子的新数据

Revving up 13C NMR shielding predictions across chemical space: Benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules

论文作者

Gupta, Amit, Chakraborty, Sabyasachi, Ramakrishnan, Raghunathan

论文摘要

对跨小分子化学化合物空间加速和定量准确筛选核磁共振光谱的需求是两倍：（1）一种可靠的“本地”机器学习（ML）策略，该策略捕获了邻域对原子“近乎景观”性能的影响；（2）使用最先进的培训第一原理方法生成的准确参考数据集。本文中，我们报告了QM9-NMR数据集，该数据集在QM9数据集的134K分子中，包括超过80万C原子，气体中的QM9数据集和五个常见的溶剂阶段。使用这些数据进行培训，我们为使用流行的本地描述符的内核 - 里奇回归模型的预测可传递性提供了基准结果。我们在100K样品上训练的最佳模型，准确地预测了50k“持有”原子的各向同性屏蔽，平均误差小于$ 1.9 $ ppm。为了快速预测新的查询分子，这些模型是从廉价理论的几何训练中训练的。此外，通过使用$δ$ -ML策略，我们将错误低于$ 1.4 $ ppm。最后，我们测试了包括10至17个重原子和药物的基准分子在内的非平凡基准组的可转移性。

The requirement for accelerated and quantitatively accurate screening of nuclear magnetic resonance spectra across the small molecules chemical compound space is two-fold: (1) a robust `local' machine learning (ML) strategy capturing the effect of neighbourhood on an atom's `near-sighted' property -- chemical shielding; (2) an accurate reference dataset generated with a state-of-the-art first principles method for training. Herein we report the QM9-NMR dataset comprising isotropic shielding of over 0.8 million C atoms in 134k molecules of the QM9 dataset in gas and five common solvent phases. Using these data for training, we present benchmark results for the prediction transferability of kernel-ridge regression models with popular local descriptors. Our best model trained on 100k samples, accurately predict isotropic shielding of 50k `hold-out' atoms with a mean error of less than $1.9$ ppm. For rapid prediction of new query molecules, the models were trained on geometries from an inexpensive theory. Furthermore, by using a $Δ$-ML strategy, we quench the error below $1.4$ ppm. Finally, we test the transferability on non-trivial benchmark sets that include benchmark molecules comprising 10 to 17 heavy atoms and drugs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题