论文标题
具有成对扰动和多扫维维的有效平行CP分解
Efficient parallel CP decomposition with pairwise perturbation and multi-sweep dimension tree
论文作者
论文摘要
具有交替的最小二乘(ALS)的CP张量分解由矩阵调度次数Khatri-Rao产品(MTTKRP)内核占主导地位,这是建立二次优化亚问题所必需的。最先进的并行ALS实现使用维度树,以避免每个ALS扫描中MTTKRP的冗余计算。在本文中,我们提出了两种新的并行算法来加速CP-ALS。我们介绍了多扫二维树(MSDT)算法,该算法需要每次n(N-1)/N扫描一次n输入张量和第一次合同输入矩阵之间的收缩。相对于以前最佳的方法,该算法将领先的计算成本降低了2(n-1)/n。此外,我们引入了一种更加沟通的方法,以使近似CP-ALS算法(成对扰动)并行。该技术对子问题使用扰动校正,而不是重新计算收缩,并渐近地加速ALS。我们的基准结果表明,与在Stampede2 SuperCutter上运行的1024个处理器上运行的最先进的尺寸树相比,每次减少时间的MSDT速度达到1.25倍,对成对扰动的速度达到1.25倍。
CP tensor decomposition with alternating least squares (ALS) is dominated in cost by the matricized-tensor times Khatri-Rao product (MTTKRP) kernel that is necessary to set up the quadratic optimization subproblems. State-of-art parallel ALS implementations use dimension trees to avoid redundant computations across MTTKRPs within each ALS sweep. In this paper, we propose two new parallel algorithms to accelerate CP-ALS. We introduce the multi-sweep dimension tree (MSDT) algorithm, which requires the contraction between an order N input tensor and the first-contracted input matrix once every (N-1)/N sweeps. This algorithm reduces the leading order computational cost by a factor of 2(N-1)/N relative to the best previously known approach. In addition, we introduce a more communication-efficient approach to parallelizing an approximate CP-ALS algorithm, pairwise perturbation. This technique uses perturbative corrections to the subproblems rather than recomputing the contractions, and asymptotically accelerates ALS. Our benchmark results show that the per-sweep time achieves 1.25X speed-up for MSDT and 1.94X speed-up for pairwise perturbation compared to the state-of-art dimension trees running on 1024 processors on the Stampede2 supercomputer.