TSM2X：GPU上的高性能高性矩阵矩阵乘法

论文标题

TSM2X：GPU上的高性能高性矩阵矩阵乘法

TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs

论文作者

Rivera, Cody, Chen, Jieyang, Xiong, Nan, Song, Shuaiwen Leon, Tao, Dingwen

论文摘要

线性代数操作已广泛用于大数据分析和科学计算中。使用常规输入优化GPU上的线性代数操作进行了许多工作。但是，当输入不是常规形状时，很少有工作重点是充分利用GPU资源。当前优化并不考虑充分利用内存带宽和计算能力；因此，他们只能实现次优的性能。在本文中，我们建议在GPU上的两类高层矩阵矩阵乘法中，提出了两种有效的算法-TSM2R和TSM2L。他们俩都专注于使用至少一个输入矩阵优化线性代数操作，又高又肤色。具体而言，TSM2R设计用于一个大型的常规矩阵，将高质矩阵倍增，而TSM2L则设计用于一个高质矩阵，将其乘一个小的常规矩阵倍增。我们实施了建议的算法，并对几种现代的NVIDIA GPU微构造进行了测试。实验表明，与当前的最新作品相比，（1）TSM2R将计算加速1.1x〜3x，并将记忆带宽利用率和计算功率利用率提高8％〜47.6％和7％和7％〜37.3％，当时常规形状的矩阵大小相对较大或相对较大或培养基；（2）当常规矩阵大小相对较小时，TSM2L将计算加速1.1x〜3.5倍，并将内存带宽利用率提高高达55％。

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms -- TSM2R and TSM2L -- for two classes of tall-and-skinny matrix-matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices is tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.1x~3x and improves the memory bandwidth utilization and computing power utilization by 8%~47.6% and 7%~37.3%, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.1x~3.5x and improve the memory bandwidth utilization by up to 55% when the regular-shaped matrix size is relatively small.

下载PDF全文

下载文献需遵守相关版权规定

论文标题