论文标题
分布式K-FAC的卷积神经网络培训
Convolutional Neural Network Training with Distributed K-FAC
论文作者
论文摘要
使用许多处理器培训神经网络可以减少时间到解决方案;但是,保持大规模的融合和效率是一项挑战。最近提出,Kronecker对近似值曲率(K-FAC)作为Fisher Information矩阵的近似,可用于天然梯度优化器。我们在这里研究了可扩展的K-FAC设计及其在大规模卷积神经网络(CNN)培训中的适用性。我们研究优化技术,例如层分配策略,无反向的二阶评估以及动态K-FAC更新解耦,以减少训练时间,同时保留收敛。我们使用应用于CIFAR-10和Imagenet-1K数据集的残留神经网络(RESNET)来评估我们K-FAC梯度预处理器的正确性和可扩展性。在ImagEnet-1K数据集上的Resnet-50中,我们的分布式K-FAC实现在18-25%的时间内收敛到75.9%的MLPERF基线时间,比经典的随机梯度下降(SGD)优化器在GPU群集上的尺度上的优化器要少。
Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9% MLPerf baseline in 18-25% less time than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.