论文标题

预期资格痕迹

Expected Eligibility Traces

论文作者

van Hasselt, Hado, Madjiheurem, Sephora, Hessel, Matteo, Silver, David, Barreto, André, Borsa, Diana

论文摘要

如何确定哪些状态和行动负责某个结果的问题被称为信用分配问题,并且仍然是强化学习和人工智能的中心研究问题。资格跟踪可以有效地对代理人所经历的状态和行动的序列有效分配,而不是对也可能导致当前状态的反事实序列。在这项工作中,我们介绍了预期的资格痕迹。预期的迹线允许单个更新,即使在这种情况下不这样做,即使他们不这样做,也可以更新可能在当前状态之前的状态和动作。我们讨论何时预期的痕迹为时间差异学习中的经典(瞬时)痕迹提供优势,并表明有时可以实现实质性改进。我们提供了一种通过类似于自举的机制在瞬时和预期痕迹之间平滑插值的方法,该机制可确保所得算法严格概括TD($λ$)。最后,我们讨论了可能的扩展和与相关思想的联系,例如后继功能。

The question of how to determine which states and actions are responsible for a certain outcome is known as the credit assignment problem and remains a central research question in reinforcement learning and artificial intelligence. Eligibility traces enable efficient credit assignment to the recent sequence of states and actions experienced by the agent, but not to counterfactual sequences that could also have led to the current state. In this work, we introduce expected eligibility traces. Expected traces allow, with a single update, to update states and actions that could have preceded the current state, even if they did not do so on this occasion. We discuss when expected traces provide benefits over classic (instantaneous) traces in temporal-difference learning, and show that sometimes substantial improvements can be attained. We provide a way to smoothly interpolate between instantaneous and expected traces by a mechanism similar to bootstrapping, which ensures that the resulting algorithm is a strict generalisation of TD($λ$). Finally, we discuss possible extensions and connections to related ideas, such as successor features.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源