论文标题
云中的gromacs:一种加快炼金术药物设计的全球超级计算机
GROMACS in the cloud: A global supercomputer to speed up alchemical drug design
论文作者
论文摘要
与传统的本地计算集群相比,我们评估最先进的高性能云计算的成本和效率。我们的用例是使用Gromacs分子动力学(MD)工具包进行的原子模拟,重点是炼金术蛋白质 - 配体结合能量计算。 我们在Amazon Web Services(AWS)云中设置了一个计算集群,该云将各种不同的实例与Intel,AMD和ARM CPU结合在一起,其中一些具有GPU加速度。使用代表性的生物分子仿真系统,我们基准Gromacs在单个实例和多个实例上的表现。因此,我们评估哪些实例提供了最高的性能,哪些是我们用例最具成本效率的实例。 我们发现,就总成本(包括硬件,人员,房间,能源和冷却)而言,在云中生产MD轨迹的成本效益与本地集群一样有效,因为选择了最佳的云实例。此外,我们发现,可以使用全球云资源来大大加速蛋白质结合亲和力估计的高通量配体筛选。对于由19,872个独立模拟组成的配体筛选研究,我们使用了研究时云中可用的所有硬件。使用超过4,000个实例,140,000个核心和3,000个GPU在全球范围内,计算扩大到达到峰值性能。我们的仿真集合在云中大约两天内完成,而几周才能完成由数百个节点组成的典型本地集群上的任务。我们证明,使用检查点总结协议可以大幅降低此类研究的成本,该协议允许使用廉价的现货定价以及以最佳的成本效率使用实例类型。
We assess costs and efficiency of state-of-the-art high performance cloud computing compared to a traditional on-premises compute cluster. Our use case are atomistic simulations carried out with the GROMACS molecular dynamics (MD) toolkit with a focus on alchemical protein-ligand binding free energy calculations. We set up a compute cluster in the Amazon Web Services (AWS) cloud that incorporates various different instances with Intel, AMD, and ARM CPUs, some with GPU acceleration. Using representative biomolecular simulation systems we benchmark how GROMACS performs on individual instances and across multiple instances. Thereby we assess which instances deliver the highest performance and which are the most cost-efficient ones for our use case. We find that, in terms of total costs including hardware, personnel, room, energy and cooling, producing MD trajectories in the cloud can be as cost-efficient as an on-premises cluster given that optimal cloud instances are chosen. Further, we find that high-throughput ligand-screening for protein-ligand binding affinity estimation can be accelerated dramatically by using global cloud resources. For a ligand screening study consisting of 19,872 independent simulations, we used all hardware that was available in the cloud at the time of the study. The computations scaled-up to reach peak performances using more than 4,000 instances, 140,000 cores, and 3,000 GPUs simultaneously around the globe. Our simulation ensemble finished in about two days in the cloud, while weeks would be required to complete the task on a typical on-premises cluster consisting of several hundred nodes. We demonstrate that the costs of such and similar studies can be drastically reduced with a checkpoint-restart protocol that allows to use cheap Spot pricing and by using instance types with optimal cost-efficiency.