论文标题
异步随机梯度下降的急剧收敛速率
A Sharp Convergence Rate for the Asynchronous Stochastic Gradient Descent
论文作者
论文摘要
当损耗函数是基于[An等人[An等人的随机修改方程)时,我们给出异步随机梯度下降(ASGD)算法的急剧收敛速率。异步随机梯度下降的随机修饰方程,ARXIV:1805.08244]。我们证明,当本地工人的数量大于预期的造时间,那么ASGD比随机梯度下降更有效。我们的理论结果还表明,更长的延迟导致收敛率较慢。此外,学习率不能小于与预期的陈旧性成反比的阈值。
We give a sharp convergence rate for the asynchronous stochastic gradient descent (ASGD) algorithms when the loss function is a perturbed quadratic function based on the stochastic modified equations introduced in [An et al. Stochastic modified equations for the asynchronous stochastic gradient descent, arXiv:1805.08244]. We prove that when the number of local workers is larger than the expected staleness, then ASGD is more efficient than stochastic gradient descent. Our theoretical result also suggests that longer delays result in slower convergence rate. Besides, the learning rate cannot be smaller than a threshold inversely proportional to the expected staleness.