GPU的收缩计算生产性能

论文标题

GPU的收缩计算生产性能

Systolic Computing on GPUs for Productive Performance

论文作者

Rong, Hongbo, Hao, Xiaochen, Liang, Yun, Xu, Lidong, Jiang, Hong H, Dubey, Pradeep

论文摘要

我们提出了一种语言和编译器，以有效地构建在GPU上运行的高性能{\ IT软件收缩阵列}。基于严格的数学基础（统一复发方程和时空变换），我们的语言具有很高的抽象水平，并且涵盖了广泛的应用。程序员{\ it指定}数据流的投影计算到线性收缩期数组上，同时将投影的详细实现留给编译器；编译器实现了指定的投影，并将线性收缩期数组映射到SIMD执行单元和GPU的向量寄存器。这样，生产率和绩效都可以同时实现。这种方法整洁地结合了循环变换，数据改组和向量注册分配到一个框架中。同时，也可以应用许多其他优化。编译器将优化组合在一起以生成有效的代码。我们在英特尔GPU上实施了方法。这是第一个允许在GPU上生产收缩期阵列的系统。我们允许多个预测，任意投影方向和线性时间表，可以在实践中表达大多数（如果不是全部）的收缩阵列。在Intel Gen9.5 GPU上具有1和2-D卷积的实验证明了该方法的一般性及其在表达各种收缩期设计方面的生产力，以寻找最佳候选者。尽管我们的收缩期阵列纯粹是在通用SIMD硬件上运行的软件，但是与GPU执行相同卷积的专业的硬件采样器相比，我们的一些最佳设计速度最高可快59 \％。总体而言，这种方法对GPU上的高效高性能计算有希望。

We propose a language and compiler to productively build high-performance {\it software systolic arrays} that run on GPUs. Based on a rigorous mathematical foundation (uniform recurrence equations and space-time transform), our language has a high abstraction level and covers a wide range of applications. A programmer {\it specifies} a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler; the compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs. In this way, both productivity and performance are achieved in the same time. This approach neatly combines loop transformations, data shuffling, and vector register allocation into a single framework. Meanwhile, many other optimizations can be applied as well; the compiler composes the optimizations together to generate efficient code. We implemented the approach on Intel GPUs. This is the first system that allows productive construction of systolic arrays on GPUs. We allow multiple projections, arbitrary projection directions and linear schedules, which can express most, if not all, systolic arrays in practice. Experiments with 1- and 2-D convolution on an Intel GEN9.5 GPU have demonstrated the generality of the approach, and its productivity in expressing various systolic designs for finding the best candidate. Although our systolic arrays are purely software running on generic SIMD hardware, compared with the GPU's specialized, hardware samplers that perform the same convolutions, some of our best designs are up to 59\% faster. Overall, this approach holds promise for productive high-performance computing on GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题