多步法方法对深度强化学习中高估的影响

论文标题

多步法方法对深度强化学习中高估的影响

The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning

论文作者

Meng, Lingheng, Gorbet, Rob, Kulić, Dana

论文摘要

在理论上和经验上，在利用价值函数的表格表示的任务中，在理论上和经验上，多个步骤（也称为n步）方法已被证明比1步方法更有效。最近，深入增强学习（DRL）的研究还表明，多步进方法提高了学习速度和最终性能，在以深度神经网络代表的价值功能和策略的应用中。但是，对实际促进性能的实际贡献缺乏了解。在这项工作中，我们分析了多步法方法对减轻DRL中高估问题的影响，在该问题中，从重播缓冲液中采样了多步体验。我们提出了在深层确定性策略梯度（DDPG）之上进行的，我们提出了多步DDPG（MDDPG），其中手动设置了不同的步骤尺寸，其变体称为混合多步ddpg（MMDDPG），其中在不同的多步备份中平均用作Q-value功能的更新目标的平均值。从经验上讲，我们表明MDDPG和MMDDPG都比使用1步备份的DDPG高出了高估问题的影响，因此，这会导致更好的最终性能和学习速度。我们还讨论了进行多步扩展的不同方法的优势和缺点，以减少近似错误，并在高估和低估之间的折衷方案揭示了离线多步法方法的基础。最后，我们比较了双胞胎延迟深层确定性策略梯度（TD3）的计算资源需求，这是一种提出的最先进的算法，旨在解决参与者批评方法的高估以及我们提出的方法，因为它们显示出可比的最终表现和学习速度。

Multi-step (also called n-step) methods in reinforcement learning (RL) have been shown to be more efficient than the 1-step method due to faster propagation of the reward signal, both theoretically and empirically, in tasks exploiting tabular representation of the value-function. Recently, research in Deep Reinforcement Learning (DRL) also shows that multi-step methods improve learning speed and final performance in applications where the value-function and policy are represented with deep neural networks. However, there is a lack of understanding about what is actually contributing to the boost of performance. In this work, we analyze the effect of multi-step methods on alleviating the overestimation problem in DRL, where multi-step experiences are sampled from a replay buffer. Specifically building on top of Deep Deterministic Policy Gradient (DDPG), we propose Multi-step DDPG (MDDPG), where different step sizes are manually set, and its variant called Mixed Multi-step DDPG (MMDDPG) where an average over different multi-step backups is used as update target of Q-value function. Empirically, we show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup, which consequently results in better final performance and learning speed. We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error, and expose the tradeoff between overestimation and underestimation that underlies offline multi-step methods. Finally, we compare the computational resource needs of Twin Delayed Deep Deterministic Policy Gradient (TD3), a state-of-art algorithm proposed to address overestimation in actor-critic methods, and our proposed methods, since they show comparable final performance and learning speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题