代理控制器表示：原理脱机RL，带有丰富的外源信息

论文标题

代理控制器表示：原理脱机RL，带有丰富的外源信息

Agent-Controller Representations: Principled Offline RL with Rich Exogenous Information

论文作者

Islam, Riashat, Tomar, Manan, Lamb, Alex, Efroni, Yonathan, Zang, Hongyu, Didolkar, Aniket, Misra, Dipendra, Li, Xin, van Seijen, Harm, Combes, Remi Tachet des, Langford, John

论文摘要

学习从基于丰富的像素的视觉观察空间中离线收集的数据来控制代理，对于增强学习的现实应用（RL）至关重要。在这种情况下，一个主要的挑战是存在输入信息，这些信息很难建模，并且与控制代理无关。理论上的RL社区已经通过外源信息的角度（即观察值中包含的任何控制率信息）解决了这个问题。例如，一个在繁忙的街道上导航的机器人需要忽略无关的信息，例如其他人在后台行走，物体的纹理或天空中的鸟类。在本文中，我们专注于具有视觉上详细的外源信息的设置，并介绍了新的离线RL基准测试，从而提供了研究此问题的能力。我们发现，当代表示学习技术可能会在噪声是一个复杂且依赖时间的过程的数据集上失败，这在实际应用中很普遍。为了解决这些问题，我们建议使用多个步骤的逆模型，这些模型对RL理论社区的极大兴趣来学习Offline-RL（ACRO）的代理控制代表。尽管很简单，并且不需要任何奖励，但我们从理论和经验上表明，这个目标创造的表示形式极大地超过了基准。

Learning to control an agent from data collected offline in a rich pixel-based visual observation space is vital for real-world applications of reinforcement learning (RL). A major challenge in this setting is the presence of input information that is hard to model and irrelevant to controlling the agent. This problem has been approached by the theoretical RL community through the lens of exogenous information, i.e, any control-irrelevant information contained in observations. For example, a robot navigating in busy streets needs to ignore irrelevant information, such as other people walking in the background, textures of objects, or birds in the sky. In this paper, we focus on the setting with visually detailed exogenous information, and introduce new offline RL benchmarks offering the ability to study this problem. We find that contemporary representation learning techniques can fail on datasets where the noise is a complex and time dependent process, which is prevalent in practical applications. To address these, we propose to use multi-step inverse models, which have seen a great deal of interest in the RL theory community, to learn Agent-Controller Representations for Offline-RL (ACRO). Despite being simple and requiring no reward, we show theoretically and empirically that the representation created by this objective greatly outperforms baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题