保守双重政策优化，用于有效的基于模型的强化学习

论文标题

保守双重政策优化，用于有效的基于模型的强化学习

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

论文作者

Zhang, Shenao

论文摘要

确保基于乐观或后采样（PSRL）的基于模型的强化增强学习（MBRL）通过引入模型的复杂性度量，以渐近地实现全局最优性。但是，对于最简单的非线性模型，复杂性可能会成倍增长，在有限的迭代中，全局收敛是不可能的。当模型遭受较大的概括误差（通过模型复杂性定量测量）时，不确定性可能很大。因此，对当前策略进行了贪婪优化的采样模型将不安排，从而导致了积极的策略更新和过度探索。在这项工作中，我们提出了涉及参考更新和保守更新的保守双重政策优化（CDPO）。该策略首先在参考模型下进行了优化，该策略模仿了PSRL的机制，同时提供了更大的稳定性。通过最大化模型值的期望来保证保守的随机性范围。没有有害的采样程序，CDPO仍然可以与PSRL产生同样的遗憾。更重要的是，CDPO同时享有单调政策的改进和全球最优性。经验结果还验证了CDPO的勘探效率。

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering more stability. A conservative range of randomness is guaranteed by maximizing the expectation of model value. Without harmful sampling procedures, CDPO can still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic policy improvement and global optimality simultaneously. Empirical results also validate the exploration efficiency of CDPO.

下载PDF全文

下载文献需遵守相关版权规定

论文标题