生成材料设计的材料变形金刚语言模型：基准研究

论文标题

生成材料设计的材料变形金刚语言模型：基准研究

Materials Transformers Language Models for Generative Materials Design: a benchmark study

论文作者

Fu, Nihang, Wei, Lai, Song, Yuqi, Li, Qinyang, Xin, Rui, Omee, Sadman Sadeed, Dong, Rongzhi, Siriwardane, Edirisuriya M. Dilanga, Hu, Jianjun

论文摘要

大型未标记语料库上的预训练的变压器语言模型已经产生了最新的最先进的结果，从而导致了自然语言处理，有机分子设计和蛋白质序列产生。但是，尚未应用这种模型来学习无机材料的组成模式。在这里，我们使用在ICSD，OQMD和材料项目数据库中存放的材料的扩展公式培训了七种现代变压器语言模型（GPT，GPT-2，GPT-2，GPT-NEO，GPT-J，BLMM，BART和ROBERTA）。六个不同的数据集，具有非电荷 - 中性或平衡的电负性样品用于对性能进行基准测试，并发现现代变压器模型的产生偏见，以生成材料组成的生成设计。我们的广泛实验表明，基于因果语言模型的材料变形金刚可以产生具有高达97.54 \％的化学有效材料组合物，即充电中性，而91.40 \％是电负性平衡的，与基线式伪造采样algorithm相比，它具有高于6倍的富集。这些模型还表现出了很高的新颖性，并且在新材料发现中的潜力是通过恢复留出的材料的能力来证明的。我们还发现，可以通过使用精选的训练集（例如高带盖材料）训练模型来量身定制生成的样品的性能。我们的实验还表明，不同模型在生成样品的属性方面都有自己的喜好，并且其运行时间复杂性各不相同。我们已经应用了材料变压器模型来发现一套使用DFT计算验证的新材料。

Pre-trained transformer language models on large unlabeled corpus have produced state-of-the-art results in natural language processing, organic molecule design, and protein sequence generation. However, no such models have been applied to learn the composition patterns of inorganic materials. Here we train a series of seven modern transformer language models (GPT, GPT-2, GPT-Neo, GPT-J, BLMM, BART, and RoBERTa) using the expanded formulas from material deposited in the ICSD, OQMD, and Materials Projects databases. Six different datasets with/out non-charge-neutral or balanced electronegativity samples are used to benchmark the performances and uncover the generation biases of modern transformer models for the generative design of materials compositions. Our extensive experiments showed that the causal language models based materials transformers can generate chemically valid materials compositions with as high as 97.54\% to be charge neutral and 91.40\% to be electronegativity balanced, which has more than 6 times higher enrichment compared to a baseline pseudo-random sampling algorithm. These models also demonstrate high novelty and their potential in new materials discovery has been proved by their capability to recover the leave-out materials. We also find that the properties of the generated samples can be tailored by training the models with selected training sets such as high-bandgap materials. Our experiments also showed that different models each have their own preference in terms of the properties of the generated samples and their running time complexity varies a lot. We have applied our materials transformer models to discover a set of new materials as validated using DFT calculations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题