论文标题
对深度梯度下降中学习步长的理论解释
Theoretical Interpretation of Learned Step Size in Deep-Unfolded Gradient Descent
论文作者
论文摘要
深度展开是一种有希望的深度学习技术,其中迭代算法对具有可训练的参数的深层网络体系结构展开。在梯度下降算法的情况下,由于训练过程的结果,人们经常观察到收敛速度的加速度,其行为不是直观或从常规理论中解释的非恒定步长尺寸参数的加速度。在本文中,我们提供了对深度无折叠下降(DUGD)的学习步骤大小的理论解释。我们首先证明,DUGD的训练过程不仅降低了平均平方误差损失,还降低了与收敛速率相关的光谱半径。接下来,我们表明,将光谱半径的上限最小化自然会导致Chebyshev步骤,该步骤是基于Chebyshev多项式的步长序列。数值实验证实了Chebyshev步骤在dugd中定性地重现了学到的步长参数,该参数提供了对学习参数的合理解释。此外,我们表明,Chebyshev步骤在没有学习参数或动量项的情况下以特定极限的一阶方法达到了一阶方法的收敛速率的下限。
Deep unfolding is a promising deep-learning technique in which an iterative algorithm is unrolled to a deep network architecture with trainable parameters. In the case of gradient descent algorithms, as a result of the training process, one often observes the acceleration of the convergence speed with learned non-constant step size parameters whose behavior is not intuitive nor interpretable from conventional theory. In this paper, we provide a theoretical interpretation of the learned step size of deep-unfolded gradient descent (DUGD). We first prove that the training process of DUGD reduces not only the mean squared error loss but also the spectral radius related to the convergence rate. Next, we show that minimizing the upper bound of the spectral radius naturally leads to the Chebyshev step which is a sequence of the step size based on Chebyshev polynomials. The numerical experiments confirm that the Chebyshev steps qualitatively reproduce the learned step size parameters in DUGD, which provides a plausible interpretation of the learned parameters. Additionally, we show that the Chebyshev steps achieve the lower bound of the convergence rate for the first-order method in a specific limit without learning parameters or momentum terms.