论文标题
感知预测反应剂,用于深入加强学习
Perception-Prediction-Reaction Agents for Deep Reinforcement Learning
论文作者
论文摘要
我们介绍了一种新的经过重复的代理体系结构和相关的辅助损失,这些损失可以改善需要长期记忆的部分可观察的任务中的强化学习。我们采用时间层次结构,使用缓慢的复发核心,使信息在长期跨度中更容易流动,并具有三个快速挑逗的复发核,其连接旨在创建信息不对称。 \ emph {反应}核心将新的观察结果与慢芯的输入结合在一起,以产生代理的策略; \ emph {感知}核心仅访问短期观察并告知慢速核心;最后,\ emph {预测}核心仅访问长期内存。辅助损失将来自所有三个核心彼此的所有核心制定的策略规范化,从而颁布了该策略应从最近或长期记忆中表达的策略。我们介绍了所得\ emph {感知预测 - 反应}(PPR)代理,并在DMLAB-30中证明了其对强LSTM代理基线的性能的提高,尤其是在需要长期内存的任务中。我们进一步显示了捕获旗的显着改善,这种环境需要代理在长期尺度上获得复杂技能的混合。在一系列消融实验中,我们探测了PPR代理的每个组成部分的重要性,确定整个新型组合对于这种有趣的结果是必不可少的。
We introduce a new recurrent agent architecture and associated auxiliary losses which improve reinforcement learning in partially observable tasks requiring long-term memory. We employ a temporal hierarchy, using a slow-ticking recurrent core to allow information to flow more easily over long time spans, and three fast-ticking recurrent cores with connections designed to create an information asymmetry. The \emph{reaction} core incorporates new observations with input from the slow core to produce the agent's policy; the \emph{perception} core accesses only short-term observations and informs the slow core; lastly, the \emph{prediction} core accesses only long-term memory. An auxiliary loss regularizes policies drawn from all three cores against each other, enacting the prior that the policy should be expressible from either recent or long-term memory. We present the resulting \emph{Perception-Prediction-Reaction} (PPR) agent and demonstrate its improved performance over a strong LSTM-agent baseline in DMLab-30, particularly in tasks requiring long-term memory. We further show significant improvements in Capture the Flag, an environment requiring agents to acquire a complicated mixture of skills over long time scales. In a series of ablation experiments, we probe the importance of each component of the PPR agent, establishing that the entire, novel combination is necessary for this intriguing result.