自动内核生成用于伏张量的核心

论文标题

自动内核生成用于伏张量的核心

Automatic Kernel Generation for Volta Tensor Cores

论文作者

Bhaskaracharya, Somashekaracharya G., Demouth, Julien, Grover, Vinod

论文摘要

神经网络中通常发生的计算成语是对矩阵乘法的结果进行一些点式操作。这种操作序列通常表示为深度学习编译器中的计算图。当编译到GPU目标时，这些计算可以单独映射到Cublas和Cudnn等库提供的手动调整实现。这些图书馆还为NVIDIA GPU中的张量核心提供了现成的支持，这可以通过对混合精确矩阵数学的专门支持来巨大的性能提升。另外，可以使用CUDA API或INLINE ASSEMBLY指令直接对张芯芯进行编程，这为此类计算开辟了自动生成有效的CUDA内核的可能性。当将多个操作融合到单个设备函数中，而不是为每个设备调用单独的内核来为整个计算图生成有效的代码，从而为整个计算图生成有效的代码时，自动生成尤其至关重要。多面体汇编技术为分析和转化仿射环nests的分析和转换提供了系统的方法。在本文中，我们描述了一种多面体方法，用于生成有效的CUDA内核，用于使用内联装配指令在NVIDIA VOLTA GPU上编程张量芯。此外，我们以这种方法为基础，以生成融合核，用于计算序列，涉及矩阵乘法和诸如偏置添加，relu激活等点式操作。对这些技术的实验评估表明，自动生成的内核可以比手动调谐的图书馆实现可以提供明显更好的性能，并且加速速度达到2.55x。

A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning compilers. When compiling to a GPU target, these computations can be individually mapped to manually tuned implementations provided by libraries such as cuBLAS and cuDNN. These libraries also provide off-the-shelf support for targeting tensor cores in NVIDIA GPUs, which can lead to huge performance boosts through their specialized support for mixed-precision matrix math. Alternatively, tensor cores can be programmed directly using CUDA APIs or inline assembly instructions, which opens up the possibility of generating efficient CUDA kernels automatically for such computations. Automatic kernel generation is particularly crucial when it is beneficial to generate efficient code for an entire computation graph by fusing several operations into a single device function instead of invoking a separate kernel for each of them. Polyhedral compilation techniques provide a systematic approach for the analysis and transformation of a sequence of affine loop-nests. In this paper, we describe a polyhedral approach to generate efficient CUDA kernels for matrix multiplication using inline assembly instructions for programming tensor cores on NVIDIA Volta GPUs. Furthermore, we build on this approach to generate fused kernels for computation sequences involving matrix multiplication and pointwise operations such as bias addition, ReLU activation etc. Experimental evaluation of these techniques show that automatically generated kernels can provide significantly better performance than manually tuned library implementations, with speedups ranging up to 2.55X.

下载PDF全文

下载文献需遵守相关版权规定

论文标题