论文标题
有多少个因素影响SGD中的最小值?
How Many Factors Influence Minima in SGD?
论文作者
论文摘要
随机梯度下降(SGD)经常用于训练深神经网络(DNN),研究工作已致力于研究SGD发现的SGD和Minima的收敛动力学。文献中确定的影响因素包括学习率,批处理大小,Hessian和梯度协方差以及随机微分方程来对SGD进行建模并建立这些因素之间的关系,以表征SGD发现的最小值。已经发现,批处理大小与学习率的比率是突出基础SGD动力学的主要因素。但是,尚未完全同意其他重要因素(例如Hessian和梯度协方差)的影响。本文描述了最近文献中的因素和关系,并提出了有关关系的数值发现。特别是,它确认了Wang(2019)中获得的四因素和一般关系结果,而在JastrzȩBski等人中发现的三因素和相关关系结果。 (2018年)可能不会超越被考虑的特殊情况。
Stochastic gradient descent (SGD) is often applied to train Deep Neural Networks (DNNs), and research efforts have been devoted to investigate the convergent dynamics of SGD and minima found by SGD. The influencing factors identified in the literature include learning rate, batch size, Hessian, and gradient covariance, and stochastic differential equations are used to model SGD and establish the relationships among these factors for characterizing minima found by SGD. It has been found that the ratio of batch size to learning rate is a main factor in highlighting the underlying SGD dynamics; however, the influence of other important factors such as the Hessian and gradient covariance is not entirely agreed upon. This paper describes the factors and relationships in the recent literature and presents numerical findings on the relationships. In particular, it confirms the four-factor and general relationship results obtained in Wang (2019), while the three-factor and associated relationship results found in Jastrzȩbski et al. (2018) may not hold beyond the considered special case.