论文标题
您需要的全部微笑:通过自然语言处理的微笑预测限制活动系数
A smile is all you need: Predicting limiting activity coefficients from SMILES with natural language processing
论文作者
论文摘要
对混合物相位平衡的知识在性质和技术化学本质上至关重要。混合物的相位平衡计算需要活性系数。但是,由于实验成本高,有关活性系数的实验数据通常受到限制。为了准确有效地预测活性系数,最近已经开发了机器学习方法。但是,对于未知分子的活性系数,当前的机器学习方法仍然很差。在这项工作中,我们介绍了一种自然语言处理网络,介绍了笑容 - 普罗托斯 - 转换器(SPT),以预测微笑代码的二进制限制活动系数。为了克服可用实验数据的局限性,我们最初是在从COSMO-RS(1000万个数据点)采样的大型合成数据数据集上训练我们的网络,然后对实验数据(20 870个数据点)进行微调。该训练策略使SPT能够准确预测限制活性系数,即使对于未知分子,与最新的活动系数预测(例如cosmo-rs,unifac,uniFAC)相比,将平均预测误差切成一半,并改善了最近的机器学习方法。
Knowledge of mixtures' phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients is often limited due to high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce the SMILES-to-Properties-Transformer (SPT), a natural language processing network to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 Million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS, UNIFAC, and improving on recent machine learning approaches.