论文标题
乐观的分配对强大的策略优化
Optimistic Distributionally Robust Policy Optimization
论文作者
论文摘要
作为广泛采用的基于策略的强化学习(RL)方法,信任区域策略优化(TRPO)和近端策略优化(PPO)很容易收敛到次优的解决方案,因为它们将策略表示限制为特定参数分布类。为了解决此问题,我们开发了一种创新的乐观分配对强大的策略优化(ODRPO)算法,该算法有效地利用了乐观的分布在强大的优化方面(DRO)方法来解决信任区域约束优化问题,而无需参数政策。我们的算法以较高的样本效率改善了TRPO和PPO,同时达到了最终政策的表现,同时达到了学习稳定性。此外,它实现了全球最佳策略更新,这在基于策略的RL算法中没有承诺。跨表格域和机器人运动任务进行的实验证明了我们方法的有效性。
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), as the widely employed policy based reinforcement learning (RL) methods, are prone to converge to a sub-optimal solution as they limit the policy representation to a particular parametric distribution class. To address this issue, we develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRPO) algorithm, which effectively utilizes Optimistic Distributionally Robust Optimization (DRO) approach to solve the trust region constrained optimization problem without parameterizing the policies. Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability. Moreover, it achieves a globally optimal policy update that is not promised in the prevailing policy based RL algorithms. Experiments across tabular domains and robotic locomotion tasks demonstrate the effectiveness of our approach.