运行前外部，地理分布式，多云的科学模拟

论文标题

运行前外部，地理分布式，多云的科学模拟

Running a Pre-Exascale, Geographically Distributed, Multi-Cloud Scientific Simulation

论文作者

Sfiligoi, Igor, Wuerthwein, Frank, Riedel, Benedikt, Schultz, David

论文摘要

当我们接近Exascale时代时，重要的是要验证现有的框架和工具仍将在该规模上起作用。此外，公共云计算已成为原型和紧急计算的可行解决方案。因此，使用云的弹性，我们已经对云前进行了一个前的HTCONDOR设置，以在云中运行科学模拟，其中选择的应用程序是Icecube的Photon传播模拟。 IE。这不是纯粹的示威活动，但也被用来为IceCube协作产生有价值且急需的科学结果。为了达到所需的量表，我们在亚马逊Web服务，Microsoft Azure和Google Cloud平台的许多地理区域的8个GPU模型中汇总了GPU资源。使用此设置，我们达到了超过51K GPU的峰值，相当于近380个Pflop32s，总体集成计算约为100K GPU小时。在本文中，我们提供了设置的描述，发现和克服的问题，以及对练习的实际科学输出的简短描述。

As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototyping and urgent computing. Using the elasticity of the Cloud, we have thus put in place a pre-exascale HTCondor setup for running a scientific simulation in the Cloud, with the chosen application being IceCube's photon propagation simulation. I.e. this was not a purely demonstration run, but it was also used to produce valuable and much needed scientific results for the IceCube collaboration. In order to reach the desired scale, we aggregated GPU resources across 8 GPU models from many geographic regions across Amazon Web Services, Microsoft Azure, and the Google Cloud Platform. Using this setup, we reached a peak of over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated compute of about 100k GPU hours. In this paper we provide the description of the setup, the problems that were discovered and overcome, as well as a short description of the actual science output of the exercise.

下载PDF全文

下载文献需遵守相关版权规定

论文标题