论文标题
用于化学空间间隙填充和新颖的复合产生的AI
AI for Chemical Space Gap Filling and Novel Compound Generation
论文作者
论文摘要
在考虑大量分子时,将它们放置在“化学空间”的背景下是由一组描述符定义的多维空间,这些空间可用于可视化和分析复合分组以及识别可能是有效结构的区域。给定的生物学或环境样品中所有可能分子的化学空间可能是巨大的,并且在很大程度上没有探索,这主要是由于蛮力方法(例如,空间中所有可能的化合物的枚举)处理“大数据”的当前限制)。人工智能(AI)的最新进展导致了多种新的化学形象工具,这些工具结合了AI技术来表征和学习分子的结构和特性,以生成可见的化合物,从而无需使用蛮力方法,从而有助于更易于访问和可探索的化学空间区域。我们已经使用了一种这样的工具,一种称为DarkChem的深度学习软件,该软件通过将它们压缩到潜在空间中来了解化合物的分子结构。有了Darkchem的设计,该潜在空间中的距离通常与化合物相似性相关联,使稀疏区域有趣的目标是化合物生成的目标,这是由于可能产生了新颖的化合物。在这项研究中,我们使用了100万个小分子(小于1000 DA)来创建所有小分子的代表性化学空间(由计算的分子特性定义)。我们确定了很少或没有化合物的区域,并研究了它们在Darkchem潜在空间中的位置。从这些空间中,我们产生了694,645个有效分子,所有这些分子都代表了迄今为止在任何化学数据库中未发现的分子。这些分子填充了分子特性空间中探测的空空间的50.8%。支持信息中提供了生成的分子。
When considering large sets of molecules, it is helpful to place them in the context of a "chemical space" - a multidimensional space defined by a set of descriptors that can be used to visualize and analyze compound grouping as well as identify regions that might be void of valid structures. The chemical space of all possible molecules in a given biological or environmental sample can be vast and largely unexplored, mainly due to current limitations in processing of 'big data' by brute force methods (e.g., enumeration of all possible compounds in a space). Recent advances in artificial intelligence (AI) have led to multiple new cheminformatics tools that incorporate AI techniques to characterize and learn the structure and properties of molecules in order to generate plausible compounds, thereby contributing to more accessible and explorable regions of chemical space without the need for brute force methods. We have used one such tool, a deep-learning software called DarkChem, which learns a representation of the molecular structure of compounds by compressing them into a latent space. With DarkChem's design, distance in this latent space is often associated with compound similarity, making sparse regions interesting targets for compound generation due to the possibility of generating novel compounds. In this study, we used 1 million small molecules (less than 1000 Da) to create a representative chemical space (defined by calculated molecular properties) of all small molecules. We identified regions with few or no compounds and investigated their location in DarkChem's latent space. From these spaces, we generated 694,645 valid molecules, all of which represent molecules not found in any chemical database to date. These molecules filled 50.8% of the probed empty spaces in molecular property space. Generated molecules are provided in the supporting information.