深度神经网络优化轨迹的突破点

论文标题

深度神经网络优化轨迹的突破点

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

论文作者

Jastrzebski, Stanislaw, Szymczak, Maciej, Fort, Stanislav, Arpit, Devansh, Tabor, Jacek, Cho, Kyunghyun, Geras, Krzysztof

论文摘要

深神经网络训练的早期阶段对于他们的最终表现至关重要。在这项工作中，我们研究了训练早期使用的随机梯度下降（SGD）的超参数如何影响其余优化轨迹。我们主张在这一轨迹上存在“断裂”点的存在，除此之外，梯度中损耗表面和噪声的曲率被SGD隐式正规化。特别是，我们在多个分类任务上进行了证明，这些任务在训练的初始阶段使用较大的学习率降低了梯度的差异，并改善了梯度协方差的条件。从优化的角度来看，这些效果是有益的，并在分裂点后变得可见。补充先前的工作，我们还表明，即使对于具有批准层的神经网络，使用较低的学习率也会导致损失表面的不良调理。简而言之，我们的工作表明，在训练的早期阶段，损失面的关键特性受SGD的强烈影响。我们认为，研究确定的影响对泛化的影响是一个有希望的未来方向。

The early phase of training of deep neural networks is critical for their final performance. In this work, we study how the hyperparameters of stochastic gradient descent (SGD) used in the early phase of training affect the rest of the optimization trajectory. We argue for the existence of the "break-even" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD. In particular, we demonstrate on multiple classification tasks that using a large learning rate in the initial phase of training reduces the variance of the gradient, and improves the conditioning of the covariance of gradients. These effects are beneficial from the optimization perspective and become visible after the break-even point. Complementing prior work, we also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers. In short, our work shows that key properties of the loss surface are strongly influenced by SGD in the early phase of training. We argue that studying the impact of the identified effects on generalization is a promising future direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题