论文标题
BRPO:批处理剩余政策优化
BRPO: Batch Residual Policy Optimization
论文作者
论文摘要
在批次加强学习(RL)中,经常将学习的策略限制为接近行为(数据生成)策略,例如,通过将学习的动作分布限制为在每个状态下的最大程度与行为策略不同。这可能会导致批处理RL过于保守,无法在经常访问的高信任状态下利用大型政策变化,而不会冒险在稀疏访问的州绩效不佳。为了解决这个问题,我们提出了剩余政策,在这种政策中,学到的政策允许偏差是国家行动依赖的。我们为RL方法提供了新的BRPO,该方法既学习政策和允许偏差,共同对策略绩效产生最大的限制。我们表明,BRPO在许多任务中实现了最新的性能。
In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, high-confidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the-art performance in a number of tasks.