论文标题
用于机器学习中使用的启发式优化算法的正式保证
Formal guarantees for heuristic optimization algorithms used in machine learning
论文作者
论文摘要
最近,随机梯度下降(SGD)及其变体已成为机器学习(ML)问题大规模优化的主要方法。已经提出了各种策略来调整步骤尺寸,从适应性步骤尺寸到启发式方法,以更改每次迭代中的步骤大小。同样,势头已被广泛用于ML任务以加速训练过程。但是,我们对它们的理论理解存在差距。在这项工作中,我们开始通过为一些启发式优化方法提供正式保证并提出改进的算法来缩小这一差距。 首先,我们分析了凸面和非凸口设置的Adagrad(延迟Adagrad)步骤大小的广义版本,这表明这些步骤尺寸允许算法自动适应随机梯度的噪声水平。我们首次向延迟的Adagrad展示了足够的条件,以确保梯度的收敛到零。此外,我们对延迟的Adagrad及其在非凸面设置中的动量变体进行了高概率分析。 其次,我们用指数和余弦的步进大小分析了SGD,在经验上取得了成功,但缺乏理论支持。我们在光滑且非凸的设置中为它们提供了最初的收敛保证,并且有或没有Polyak-olojasiewicz(PL)条件。我们还显示了它们在PL条件下适应噪声的良好特性。 第三,我们研究动量方法的最后迭代。我们证明了SGD的最后一个迭代的凸设置中的第一个下限,并以恒定的势头。此外,我们研究了一类跟随规范的基于领先者的动量算法,并随着动量和缩小的更新而增加。我们表明,他们的最后一个迭代具有无约束凸随机优化问题的最佳收敛性。
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in ML tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms. First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show for the first time sufficient conditions for Delayed AdaGrad to achieve almost sure convergence of the gradients to zero. Moreover, we present a high probability analysis for Delayed AdaGrad and its momentum variant in the non-convex setting. Second, we analyze SGD with exponential and cosine step sizes, which are empirically successful but lack theoretical support. We provide the very first convergence guarantees for them in the smooth and non-convex setting, with and without the Polyak-Łojasiewicz (PL) condition. We also show their good property of adaptivity to noise under the PL condition. Third, we study the last iterate of momentum methods. We prove the first lower bound in the convex setting for the last iterate of SGD with constant momentum. Moreover, we investigate a class of Follow-The-Regularized-Leader-based momentum algorithms with increasing momentum and shrinking updates. We show that their last iterate has optimal convergence for unconstrained convex stochastic optimization problems.