论文标题
大量可扩展的模板算法
Massively scalable stencil algorithm
论文作者
论文摘要
模具计算是许多科学和工业应用的核心。不幸的是,由于内存访问权限较低,模板算法在具有基于缓存的内存层次结构的机器上的性能较差。这项工作表明,对于模具计算,一种利用局部通信策略的新型算法有效利用了没有缓存层次结构的小脑WSE-2。这项研究着重于3D波方程的25分模板有限差分方法,这是一种在地球建模中经常用作数值模拟的内核。从本质上讲,该算法将内存访问交易以进行数据通信,并利用体系结构提供的快速通信结构。算法 - 历史上的内存绑定 - 变为计算结合。这使实施实现可以实现接近完美的缩放率,在WSE-2上达到了503个TFLOPS,这一数字只有完整的簇才能最终产生。
Stencil computations lie at the heart of many scientific and industrial applications. Unfortunately, stencil algorithms perform poorly on machines with cache based memory hierarchy, due to low re-use of memory accesses. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the Cerebras WSE-2, which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm -- historically memory bound -- becomes compute bound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on WSE-2, a figure that only full clusters can eventually yield.