与熵正则化的自然政策梯度方法的快速全球融合

论文标题

与熵正则化的自然政策梯度方法的快速全球融合

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

论文作者

Cen, Shicong, Cheng, Chen, Chen, Yuxin, Wei, Yuting, Chi, Yuejie

论文摘要

自然政策梯度（NPG）方法是当代强化学习中使用最广泛的政策优化算法之一。这种方法通常与熵正则化（一种鼓励探索的算法方案）结合使用，并与软性政策迭代和信任区域政策优化密切相关。尽管取得了经验成功，但即使在表格设置中，NPG方法的理论基础仍然有限。本文开发了$ \ textit {non-asymptotic} $收敛保证，用于在SoftMax参数化下为熵注册的NPG方法保证，重点介绍了Markov折扣决策过程（MDPS）。假设访问精确的策略评估，我们证明该算法在计算正规化MDP的最佳值函数时，在最佳策略周围进入局部区域时，线性收敛甚至是四边形。此外，相对于政策评估的不确定性，该算法是稳定的。我们的收敛结果可容纳广泛的学习率，并阐明了熵正则化在实现快速收敛中的作用。

Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization -- an algorithmic scheme that encourages exploration -- and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. This paper develops $\textit{non-asymptotic}$ convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly -- or even quadratically once it enters a local region around the optimal policy -- when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题