论文标题

感知预测反应剂,用于深入加强学习

Perception-Prediction-Reaction Agents for Deep Reinforcement Learning

论文作者

Stooke, Adam, Dalibard, Valentin, Jayakumar, Siddhant M., Czarnecki, Wojciech M., Jaderberg, Max

论文摘要

我们介绍了一种新的经过重复的代理体系结构和相关的辅助损失,这些损失可以改善需要长期记忆的部分可观察的任务中的强化学习。我们采用时间层次结构,使用缓慢的复发核心,使信息在长期跨度中更容易流动,并具有三个快速挑逗的复发核,其连接旨在创建信息不对称。 \ emph {反应}核心将新的观察结果与慢芯的输入结合在一起,以产生代理的策略; \ emph {感知}核心仅访问短期观察并告知慢速核心;最后,\ emph {预测}核心仅访问长期内存。辅助损失将来自所有三个核心彼此的所有核心制定的策略规范化,从而颁布了该策略应从最近或长期记忆中表达的策略。我们介绍了所得\ emph {感知预测 - 反应}(PPR)代理,并在DMLAB-30中证明了其对强LSTM代理基线的性能的提高,尤其是在需要长期内存的任务中。我们进一步显示了捕获旗的显着改善,这种环境需要代理在长期尺度上获得复杂技能的混合。在一系列消融实验中,我们探测了PPR代理的每个组成部分的重要性,确定整个新型组合对于这种有趣的结果是必不可少的。

We introduce a new recurrent agent architecture and associated auxiliary losses which improve reinforcement learning in partially observable tasks requiring long-term memory. We employ a temporal hierarchy, using a slow-ticking recurrent core to allow information to flow more easily over long time spans, and three fast-ticking recurrent cores with connections designed to create an information asymmetry. The \emph{reaction} core incorporates new observations with input from the slow core to produce the agent's policy; the \emph{perception} core accesses only short-term observations and informs the slow core; lastly, the \emph{prediction} core accesses only long-term memory. An auxiliary loss regularizes policies drawn from all three cores against each other, enacting the prior that the policy should be expressible from either recent or long-term memory. We present the resulting \emph{Perception-Prediction-Reaction} (PPR) agent and demonstrate its improved performance over a strong LSTM-agent baseline in DMLab-30, particularly in tasks requiring long-term memory. We further show significant improvements in Capture the Flag, an environment requiring agents to acquire a complicated mixture of skills over long time scales. In a series of ablation experiments, we probe the importance of each component of the PPR agent, establishing that the entire, novel combination is necessary for this intriguing result.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源