论文标题
具有平稳激活的深度学习模型的限制强大凸度
Restricted Strong Convexity of Deep Learning Models with Smooth Activations
论文作者
论文摘要
我们考虑具有平滑激活功能的深度学习模型的优化问题。尽管从``接近初始化''的角度从“近乎初始化”的角度出现了有影响力的结果,但我们对该问题阐明了大量新的启示。特别是,我们为具有$ L $层的此类型号,$ m $宽度和$σ_0^2 $初始化差异做出了两个关键的技术贡献。首先,对于合适的$σ_0^2 $,我们建立了一个$ o(\ frac {\ text {poly}(l)} {\ sqrt {m}})$上限在此类模型的Hessian频谱规范上,可以相当锐利地提高先前的结果。其次,我们基于限制的强凸度(RSC)引入了新的优化分析,只要预测变量的平均梯度的平方为$ω(\ frac {\ frac {\ frac {\ text {poly}(l)} {\ sqrt {m}}})$。我们还提出了更多一般损失的结果。 The RSC based analysis does not need the ``near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results支持我们的理论进步。
We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $σ_0^2$ initialization variance. First, for suitable $σ_0^2$, we establish a $O(\frac{\text{poly}(L)}{\sqrt{m}})$ upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $Ω(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss. We also present results for more general losses. The RSC based analysis does not need the ``near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.