论文标题

分布式K-FAC的卷积神经网络培训

Convolutional Neural Network Training with Distributed K-FAC

论文作者

Pauloski, J. Gregory, Zhang, Zhao, Huang, Lei, Xu, Weijia, Foster, Ian T.

论文摘要

使用许多处理器培训神经网络可以减少时间到解决方案;但是,保持大规模的融合和效率是一项挑战。最近提出,Kronecker对近似值曲率(K-FAC)作为Fisher Information矩阵的近似,可用于天然梯度优化器。我们在这里研究了可扩展的K-FAC设计及其在大规模卷积神经网络(CNN)培训中的适用性。我们研究优化技术,例如层分配策略,无反向的二阶评估以及动态K-FAC更新解耦,以减少训练时间,同时保留收敛。我们使用应用于CIFAR-10和Imagenet-1K数据集的残留神经网络(RESNET)来评估我们K-FAC梯度预处理器的正确性和可扩展性。在ImagEnet-1K数据集上的Resnet-50中,我们的分布式K-FAC实现在18-25%的时间内收敛到75.9%的MLPERF基线时间,比经典的随机梯度下降(SGD)优化器在GPU群集上的尺度上的优化器要少。

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9% MLPerf baseline in 18-25% less time than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源