对（减少方差）政策梯度和自然政策梯度方法的改进分析

论文标题

对（减少方差）政策梯度和自然政策梯度方法的改进分析

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

论文作者

Liu, Yanli, Zhang, Kaiqing, Başar, Tamer, Yin, Wotao

论文摘要

在本文中，我们在一般平滑的策略参数下，我们对政策梯度（PG），天然PG（NPG）方法及其方差减少变体的收敛性进行了重新访问和改善。更具体地说，由于该策略的Fisher信息矩阵是积极的：i）我们表明，最先进的方差减少了PG方法，该方法仅显示为固定点融合到固定点，收敛到全球最佳价值，直至由于策略参数造成的一些固有函数近似误差； ii）我们表明NPG的样本复杂性较低； iii）我们提出了SRVR-NPG，该SRVR-NPG将差异减少到NPG更新中。我们的改进是从观察到（降低方差）PG和NPG方法的收敛性可以互相改进：PG的固定收敛分析也可以应用于NPG，NPG的全局收敛分析可以帮助建立（方差降低）PG方法的全局收敛性。我们的分析仔细整合了这两条作品的优势。由于这种改进，我们还可以通过全球收敛和有效的有限样本复杂性来减少NPG的差异。

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variance-reduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied to NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods. Our analysis carefully integrates the advantages of these two lines of works. Thanks to this improvement, we have also made variance-reduction for NPG possible, with both global convergence and an efficient finite-sample complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题