非线性梯度映射和随机优化：具有重型尾噪声应用的一般框架

论文标题

非线性梯度映射和随机优化：具有重型尾噪声应用的一般框架

Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise

论文作者

Jakovetic, Dusan, Bajovic, Dragana, Sahu, Anit Kumar, Kar, Soummya, Milosevic, Nemanja, Stamenkovic, Dusan

论文摘要

当梯度噪声表现出沉重的尾巴时，我们引入了一个非线性随机梯度下降（SGD）的一般框架。所提出的框架包含几种流行的非线性选择，例如剪辑，归一化，签名或量化梯度，但我们也考虑了新颖的非线性选择。我们为所考虑的方法建立强有力的融合可以确保，假设Lipschitz连续梯度在非常一般的梯度噪声上具有强大的凸成本函数。最值得注意的是，我们表明，对于具有有限输出的非线性，对于可能没有有限的秩序噪声，非线性SGD SGD的平均误差（MSE），或等效地，预期的成本函数的最佳差距，以〜$ O（1/T^q q pem）$，$，$，$ζ$ fartive to y offer。相反，对于相同的噪声设置，线性SGD生成具有无限差异的序列。此外，对于可以将组件脱钩的非线性，例如，符号梯度或组件的缩写，我们表明，非线性SGD无线性SGD（本地）在弱的逆转义中实现了A $ O（1/T）$的速率，并明确地量化了相应的差异差异。实验表明，尽管我们的框架比在重尾噪声下对SGD的现有研究更为笼统，但我们框架中的几种易于实现的非线性在带有沉重尾部噪音的真实数据集上与最先进的替代方案具有竞争力。

We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assuming a strongly convex cost function with Lipschitz continuous gradients under very general assumptions on the gradient noise. Most notably, we show that, for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD's mean squared error (MSE), or equivalently, the expected cost function's optimality gap, converges to zero at rate~$O(1/t^ζ)$, $ζ\in (0,1)$. In contrast, for the same noise setting, the linear SGD generates a sequence with unbounded variances. Furthermore, for the nonlinearities that can be decoupled component wise, like, e.g., sign gradient or component-wise clipping, we show that the nonlinear SGD asymptotically (locally) achieves a $O(1/t)$ rate in the weak convergence sense and explicitly quantify the corresponding asymptotic variance. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets with heavy tail noises.

下载PDF全文

下载文献需遵守相关版权规定

论文标题