感知预测反应剂，用于深入加强学习

论文标题

感知预测反应剂，用于深入加强学习

Perception-Prediction-Reaction Agents for Deep Reinforcement Learning

论文作者

Stooke, Adam, Dalibard, Valentin, Jayakumar, Siddhant M., Czarnecki, Wojciech M., Jaderberg, Max

论文摘要

我们介绍了一种新的经过重复的代理体系结构和相关的辅助损失，这些损失可以改善需要长期记忆的部分可观察的任务中的强化学习。我们采用时间层次结构，使用缓慢的复发核心，使信息在长期跨度中更容易流动，并具有三个快速挑逗的复发核，其连接旨在创建信息不对称。 \ emph {反应}核心将新的观察结果与慢芯的输入结合在一起，以产生代理的策略； \ emph {感知}核心仅访问短期观察并告知慢速核心；最后，\ emph {预测}核心仅访问长期内存。辅助损失将来自所有三个核心彼此的所有核心制定的策略规范化，从而颁布了该策略应从最近或长期记忆中表达的策略。我们介绍了所得\ emph {感知预测 - 反应}（PPR）代理，并在DMLAB-30中证明了其对强LSTM代理基线的性能的提高，尤其是在需要长期内存的任务中。我们进一步显示了捕获旗的显着改善，这种环境需要代理在长期尺度上获得复杂技能的混合。在一系列消融实验中，我们探测了PPR代理的每个组成部分的重要性，确定整个新型组合对于这种有趣的结果是必不可少的。

We introduce a new recurrent agent architecture and associated auxiliary losses which improve reinforcement learning in partially observable tasks requiring long-term memory. We employ a temporal hierarchy, using a slow-ticking recurrent core to allow information to flow more easily over long time spans, and three fast-ticking recurrent cores with connections designed to create an information asymmetry. The \emph{reaction} core incorporates new observations with input from the slow core to produce the agent's policy; the \emph{perception} core accesses only short-term observations and informs the slow core; lastly, the \emph{prediction} core accesses only long-term memory. An auxiliary loss regularizes policies drawn from all three cores against each other, enacting the prior that the policy should be expressible from either recent or long-term memory. We present the resulting \emph{Perception-Prediction-Reaction} (PPR) agent and demonstrate its improved performance over a strong LSTM-agent baseline in DMLab-30, particularly in tasks requiring long-term memory. We further show significant improvements in Capture the Flag, an environment requiring agents to acquire a complicated mixture of skills over long time scales. In a series of ablation experiments, we probe the importance of each component of the PPR agent, establishing that the entire, novel combination is necessary for this intriguing result.

下载PDF全文

下载文献需遵守相关版权规定

论文标题