分布式K-FAC的卷积神经网络培训

论文标题

分布式K-FAC的卷积神经网络培训

Convolutional Neural Network Training with Distributed K-FAC

论文作者

Pauloski, J. Gregory, Zhang, Zhao, Huang, Lei, Xu, Weijia, Foster, Ian T.

论文摘要

使用许多处理器培训神经网络可以减少时间到解决方案；但是，保持大规模的融合和效率是一项挑战。最近提出，Kronecker对近似值曲率（K-FAC）作为Fisher Information矩阵的近似，可用于天然梯度优化器。我们在这里研究了可扩展的K-FAC设计及其在大规模卷积神经网络（CNN）培训中的适用性。我们研究优化技术，例如层分配策略，无反向的二阶评估以及动态K-FAC更新解耦，以减少训练时间，同时保留收敛。我们使用应用于CIFAR-10和Imagenet-1K数据集的残留神经网络（RESNET）来评估我们K-FAC梯度预处理器的正确性和可扩展性。在ImagEnet-1K数据集上的Resnet-50中，我们的分布式K-FAC实现在18-25％的时间内收敛到75.9％的MLPERF基线时间，比经典的随机梯度下降（SGD）优化器在GPU群集上的尺度上的优化器要少。

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9% MLPerf baseline in 18-25% less time than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题