GPU的罐子加速度用于大规模并行直接的直接数值模拟的典型流体流量

论文标题

GPU的罐子加速度用于大规模并行直接的直接数值模拟的典型流体流量

GPU acceleration of CaNS for massively-parallel direct numerical simulations of canonical fluid flows

论文作者

Costa, Pedro, Phillips, Everett, Brandt, Luca, Fatica, Massimiliano

论文摘要

这项工作介绍了开源代码罐的GPU加速度，用于对规范流体流的非常快速平行的模拟。罐中多个CPU Navier-Stokes求解器的独特特征是基于特征函数扩展的方法，其二阶有限差异泊松方程的快速直接求解器。求解器实现了对统一框架中此类问题有效的所有边界条件。在这里，我们使用CUDA FORTRAN扩展了求解器的求解器。该移植物广泛使用CUF内核，并通过CUDA FORTRAN的统一内存功能大大简化了该核心，该功能处理主机（CPU）和设备（GPU）之间的数据迁移而没有定义源代码中的新数组。总体实现已针对湍流通道流的基准数据及其在NVIDIA DGX-2系统（16 Tesla V100 32GB）上评估的性能进行验证。与在最新的多个CPU群集上实施相比，GPU加速实现的每时间步骤的壁时间时间很小，只要域分配足够小，数据大多就驻留在GPU上。根据MIT许可证的条款，已免费提供该实施和开源。

This work presents the GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows. The distinct feature of the many-CPU Navier-Stokes solver in CaNS is its fast direct solver for the second-order finite-difference Poisson equation, based on the method of eigenfunction expansions. The solver implements all the boundary conditions valid for this type of problems in a unified framework. Here, we extend the solver for GPU-accelerated clusters using CUDA Fortran. The porting makes extensive use of CUF kernels and has been greatly simplified by the unified memory feature of CUDA Fortran, which handles the data migration between host (CPU) and device (GPU) without defining new arrays in the source code. The overall implementation has been validated against benchmark data for turbulent channel flow and its performance assessed on a NVIDIA DGX-2 system (16 Tesla V100 32Gb, connected with NVLink via NVSwitch). The wall-clock time per time step of the GPU-accelerated implementation is impressively small when compared to its CPU implementation on state-of-the-art many-CPU clusters, as long as the domain partitioning is sufficiently small that the data resides mostly on the GPUs. The implementation has been made freely available and open-source under the terms of an MIT license.

下载PDF全文

下载文献需遵守相关版权规定

论文标题