论文标题
在深层神经网络与层宽度相关的凸行为上
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width
论文作者
论文摘要
神经网络的Hessian可以分解为两个矩阵的总和:(i)阳性半芬矿化高斯牛顿矩阵G,以及(ii)含有负本特征值的矩阵H。我们观察到,对于更广泛的网络,在训练开始和结束时通过正曲线的表面将梯度下降优化的操作最小化,并在介于两者之间接近零曲率。换句话说,看来在训练过程的关键部分中,宽网络中的黑森式由组件G所主导。为了解释这种现象,我们表明,当使用常见方法论初始化时,过度参数化网络的梯度与H近似正交,因此损失表面的弯曲表面在方向上的弯曲表面是严格的呈呈呈梯度的。
The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues. We observe that for wider networks, minimizing the loss with the gradient descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G. To explain this phenomenon, we show that when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.