有效量化张量芯的稀疏基质操作

论文标题

有效量化张量芯的稀疏基质操作

Efficient Quantized Sparse Matrix Operations on Tensor Cores

论文作者

Li, Shigang, Osawa, Kazuki, Hoefler, Torsten

论文摘要

成倍增长的模型大小驱动了深度学习的持续成功，但它带来了过度的计算和记忆成本。从算法的角度来看，已经研究了模型的稀疏和量化以减轻问题。从体系结构的角度来看，硬件供应商提供了张量的核心。但是，由于严格的数据布局要求以及缺乏有效操纵低精度整数的支持，从稀疏的低精度矩阵操作中获得实践加速是非常具有挑战性的。我们提出了Magicube，这是一个高性能的稀疏矩阵库，用于张量芯上的低精度整数。 Magicube支持SPMM和SDDMM，这是深学习中的两个主要稀疏操作，具有混合精度。 NVIDIA A100 GPU的实验结果表明，Magicube平均在供应商优化的库中平均达到1.44倍（高达2.37倍）的速度，以实现稀疏内核的速度，而在最先进的ART上进行了1.43倍的速度，具有可比的准确性，用于终端端到端稀疏的变压器的稀疏变压器的优点。

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题