论文标题
用一步Q学习来缓解演员批评方法中的非政策偏差:一种新颖的校正方法
Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach
论文作者
论文摘要
与派利对应物相比,通过反复使用先前收集的数据,可以通过反复使用无模型的深度强化学习可以提高数据效率。但是,当代理商政策的基本分布与收集的数据增加之间的差异时,非政策学习变得具有挑战性。尽管提出了精心研究的重要性采样和范围的政策梯度技术来弥补这一差异,但他们通常需要一系列长途轨迹,并引起其他问题,例如消失/爆炸梯度或丢弃许多有用的经验,最终增加了计算复杂性。此外,它们对确定性深度神经网络近似的连续动作领域或策略的概括是严格限制的。为了克服这些局限性,我们引入了一种新颖的政策相似性度量,以减轻这种差异在持续控制中的影响。我们的方法提供了适用于确定性策略网络的足够单步非政策校正。理论和实证研究表明,它可以通过与Q-Learning和政策优化的有效学习率的有效时间表相比,以更少的步骤获得比竞争方法更少的步骤获得更高的步骤,从而实现“安全”的非货币学习,并通过更少的步骤获得最先进的回报。
Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent's policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks. Theoretical and empirical studies demonstrate that it can achieve a "safe" off-policy learning and substantially improve the state-of-the-art by attaining higher returns in fewer steps than the competing methods through an effective schedule of the learning rate in Q-learning and policy optimization.