针对安全加强学习的限制变异政策优化

论文标题

针对安全加强学习的限制变异政策优化

Constrained Variational Policy Optimization for Safe Reinforcement Learning

论文作者

Liu, Zuxin, Cen, Zhepeng, Isenbaev, Vladislav, Liu, Wei, Wu, Zhiwei Steven, Li, Bo, Zhao, Ding

论文摘要

安全的加强学习（RL）旨在学习在将其部署到关键安全应用程序中之前满足某些约束的政策。以前的原始双重样式方法遭受不稳定性问题的困扰，并且缺乏最佳保证。本文从概率推断的角度克服了问题。我们在政策学习过程中介绍了一种新颖的期望最大化方法来自然纳入约束：1）在凸优化（E-step）之后，可以以封闭形式计算可证明的最佳最佳非参数变化分布； 2）基于最佳变分分布（M-step），在信任区域内改进了策略参数。提出的算法将安全的RL问题分解为凸优化阶段和监督学习阶段，从而产生了更稳定的培训性能。在连续机器人任务上进行的广泛实验表明，所提出的方法比基线相比，所提出的方法可以明显取得更好的约束满意度和更好的样品效率。该代码可在https://github.com/liuzuxin/cvpo-safe-rl上找到。

Safe reinforcement learning (RL) aims to learn policies that satisfy certain constraints before deploying them to safety-critical applications. Previous primal-dual style approaches suffer from instability issues and lack optimality guarantees. This paper overcomes the issues from the perspective of probabilistic inference. We introduce a novel Expectation-Maximization approach to naturally incorporate constraints during the policy learning: 1) a provable optimal non-parametric variational distribution could be computed in closed form after a convex optimization (E-step); 2) the policy parameter is improved within the trust region based on the optimal variational distribution (M-step). The proposed algorithm decomposes the safe RL problem into a convex optimization phase and a supervised learning phase, which yields a more stable training performance. A wide range of experiments on continuous robotic tasks shows that the proposed method achieves significantly better constraint satisfaction performance and better sample efficiency than baselines. The code is available at https://github.com/liuzuxin/cvpo-safe-rl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题