双重优化的两次计算框架：复杂性分析和对演员评论的应用

论文标题

双重优化的两次计算框架：复杂性分析和对演员评论的应用

A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic

论文作者

Hong, Mingyi, Wai, Hoi-To, Wang, Zhaoran, Yang, Zhuoran

论文摘要

本文分析了两次计算的随机算法框架，以进行双光线优化。二杆优化是一类表现出两级结构的问题，其目标是将带有变量的外部目标函数最小化，该变量被限制为（内部）优化问题的最佳解决方案。我们考虑内部问题不受约束且强烈凸的情况，而外部问题受到约束并具有平稳的目标函数。我们提出了一个两次尺度的随机近似（TTSA）算法，以解决这种二聚体问题。在该算法中，内部问题使用具有较大步长的随机梯度更新，而投影的随机梯度更新具有较小的步骤大小，用于外部问题。我们在各种环境下分析TTSA算法的收敛速率：当外部问题强烈凸（分别〜〜弱凸）时，TTSA算法会发现$ \ Mathcal {o}（O}（k^{ - 2/3}）解决方案，其中$ k $是总迭代编号。作为一种应用，我们表明，两次自然参与者 - 批评近端优化算法可以看作是我们TTSA框架的特殊情况。重要的是，与全球最佳策略相比，自然参与者批评算法的汇总速率（k^{ - 1/4}）$在预期的折扣奖励方面汇总。

This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题