论文标题
梯度方法永不过分地在可分离数据上
Gradient Methods Never Overfit On Separable Data
论文作者
论文摘要
最近的一系列作品确定,当使用梯度方法和指数尾损失训练可分离数据的线性预测因子时,预测因子渐近地将方向汇聚到最大细边缘预测指标。结果,预测因子渐近不会过度拟合。但是,这并不能解决是否有限制的迭代次数之后是否会出现过度拟合的问题。在本文中,我们正式表明标准梯度方法(尤其是梯度流,梯度下降和随机梯度下降)永远不会过分贴在可分离的数据上:如果我们在$ M $的数据集中运行这些方法,以$ t $迭代运行这些方法,以实质性的风险和一般性误差,以本质上的最佳速率/$ \ tildel} {\ Mathc} {; t)$ up至$ t \ of m $,此时,概括错误固定在$ \ tilde {\ mathcal {o}}}}}}(1/γ^2 m)$的基本最佳级别上,无论$ t $有多大。一路上,我们就数据集的边缘违规数量介绍了非反应界限,并证明了它们的紧密性。
A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for $T$ iterations on a dataset of size $m$, both the empirical risk and the generalization error decrease at an essentially optimal rate of $\tilde{\mathcal{O}}(1/γ^2 T)$ up till $T\approx m$, at which point the generalization error remains fixed at an essentially optimal level of $\tilde{\mathcal{O}}(1/γ^2 m)$ regardless of how large $T$ is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness.