使用泰勒附属梯度来改善弗兰克 - 沃尔夫的经验风险最小化方法

论文标题

使用泰勒附属梯度来改善弗兰克 - 沃尔夫的经验风险最小化方法

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

论文作者

Xiong, Zikai, Freund, Robert M.

论文摘要

由于迭代元素的结构诱导属性，尤其是在可行性集对可行集合的线性最小化的计算上比投影更有效的设置，因此Frank-Wolfe方法在统计和机器学习应用中变得越来越有用。在经验风险最小化的设置中，统计和机器学习中的基本优化问题之一 - 弗兰克 - 沃尔夫方法的计算效率通常在数据观察的数量$ n $中线性增长。这与典型随机投影方法的情况形成鲜明对比。为了减少对$ n $的依赖，我们研究典型平滑损耗功能的二阶平滑度（例如，最小二乘损失和逻辑损失），我们建议使用泰勒串联序列梯度进行修改Frank-Wolfe方法，包括用于确定性和随机设置的变体。与当前的最新方法相比，在该制度中，最佳公差$ \ varepsilon $足够小，我们的方法能够同时降低对大$ n $的依赖，同时获得Frank-Wolfe方法的最佳收敛速率，均在convex和convex和非convex设置中。我们还提出了一种新型的自适应阶梯尺寸方法，我们可以为其提供计算保证。最后，我们提出的计算实验表明，对于凸面和非凸线二进制分类问题，我们的方法对现有数据集的现有方法表现出非常明显的速度。

The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题