论文标题
CRPO:一种通过融合保证安全加强学习的新方法
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee
论文作者
论文摘要
在安全的加强学习(SRL)问题中,代理商探索了环境,以最大程度地提高预期的总奖励,同时避免违反对许多预期总成本的某些限制。通常,此类SRL问题具有非凸目标功能受到多个非convex限制的影响,因此解决非常具有挑战性,尤其是提供全球最佳政策。许多流行的SRL算法采用了一种原始的双重结构,该结构利用双重变量的更新来满足约束。相比之下,我们提出了一种原始方法,称为“约束构造政策优化”(CRPO),该方法在客观改善和约束满意度之间交替更新策略。 CRPO提供了一个原始型算法框架来解决SRL问题,每个策略更新都可以采取任何策略优化步骤的变体。为了证明CRPO的理论表现,我们为每个政策更新步骤采用自然政策梯度(NPG),并表明CRPO实现了$ \ Mathcal {O}(1/\ sqrt {t})$在约束策略中的全球最佳策略的$ \ Mathcal {$ \ Mathcal {$} $}(o}(1/s) 满意。这是具有全球最佳保证的原始SRL算法的第一个有限时间分析。我们的经验结果表明,CRPO可以显着胜过现有的原始二基线算法。
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.