论文标题
具有低同步骨架的自适应重新启动的块Krrylov子空间方法
Adaptively restarted block Krylov subspace methods with low-synchronization skeletons
论文作者
论文摘要
随着Oak Ridge国家实验室的前沿超级计算机最近认识的Exascale表演,减少QR分解等内核的沟通变得更加必要。低同步革兰氏 - schmidt方法,首先在[K.中引入。 Świrydowicz,J。Langou,S。Ananthan,U。Yang和S. Thomas,低同步的革兰氏schmidt和概括的最小残留算法,Numer。林。 alg。 Appl。,卷。 28(2),E2343,2020]]已被证明可以在高性能分布式计算中提高Arnoldi方法的可扩展性。低同步革兰氏夹的块版本显示了加速算法的进一步潜力,因为列批次批次允许使用矩阵矩阵操作最大化缓存使用率。在这项工作中,低同步块革兰氏链schmidt变体来自[E.卡森(K. alg。 Appl。,638,pp。150--195,2022]被转化为块Arnoldi变体,用于块完全正交方法(BFOM)和块广义最小残留方法(BGMRE)。开发了一种自适应重启启发式,以处理随着Krylov的条件数量增加而产生的不稳定性。这些方法的性能,准确性和稳定性是通过用MATLAB编写的灵活的基准测试工具来评估的。该工具的模块化还允许像全局内部产品一样广泛的块内产物。
With the recent realization of exascale performace by Oak Ridge National Laboratory's Frontier supercomputer, reducing communication in kernels like QR factorization has become even more imperative. Low-synchronization Gram-Schmidt methods, first introduced in [K. Świrydowicz, J. Langou, S. Ananthan, U. Yang, and S. Thomas, Low Synchronization Gram-Schmidt and Generalized Minimum Residual Algorithms, Numer. Lin. Alg. Appl., Vol. 28(2), e2343, 2020], have been shown to improve the scalability of the Arnoldi method in high-performance distributed computing. Block versions of low-synchronization Gram-Schmidt show further potential for speeding up algorithms, as column-batching allows for maximizing cache usage with matrix-matrix operations. In this work, low-synchronization block Gram-Schmidt variants from [E. Carson, K. Lund, M. Rozložník, and S. Thomas, Block Gram-Schmidt algorithms and their stability properties, Lin. Alg. Appl., 638, pp. 150--195, 2022] are transformed into block Arnoldi variants for use in block full orthogonalization methods (BFOM) and block generalized minimal residual methods (BGMRES). An adaptive restarting heuristic is developed to handle instabilities that arise with the increasing condition number of the Krylov basis. The performance, accuracy, and stability of these methods are assessed via a flexible benchmarking tool written in MATLAB. The modularity of the tool additionally permits generalized block inner products, like the global inner product.