进行深度学习的大批量培训的外推

论文标题

进行深度学习的大批量培训的外推

Extrapolation for Large-batch Training in Deep Learning

论文作者

Lin, Tao, Kong, Lingjing, Stich, Sebastian U., Jaggi, Martin

论文摘要

深度学习网络通常是通过随机梯度下降（SGD）方法训练的，通过在训练数据的很小一部分上估算梯度，可以迭代地改善模型参数。将批处理大小增加到培训数据的很大一部分以改善训练时间的主要障碍是绩效的持续退化（概括性差距）。为了解决这个问题，最近的工作建议在计算随机梯度时向模型参数增加少量扰动，并报告由于平滑效应而提高的概括性能。但是，这种方法的理解很少。它通常需要特定于模型的噪声和微调。为了减轻这些缺点，我们建议使用相反的计算有效的外推（外部）来稳定优化轨迹，同时仍然受益于平滑，以避免锐利的最小值。从优化的角度来看，这种原则性的方法是良好的，我们表明可以在我们建议的统一框架中涵盖许多变体。我们证明了这种新型方案的融合，并严格评估其在Resnet，LSTM和Transformer上的经验性能。我们证明，在各种实验中，该方案允许在达到或超过SOTA准确性时比以前更大的批量尺寸。

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题