使用人类反馈在3D环境中安全深度RL

论文标题

使用人类反馈在3D环境中安全深度RL

Safe Deep RL in 3D Environments using Human Feedback

论文作者

Rahtz, Matthew, Varma, Vikrant, Kumar, Ramana, Kenton, Zachary, Legg, Shane, Leike, Jan

论文摘要

代理在培训和部署过程中应避免不安全的行为。这通常需要模拟器和不安全行为的程序规范。不幸的是，模拟器并不总是可用的，并且对于许多实际任务而言，在程序上指定约束可能是困难或不可能的。最近引入的技术，旨在通过从安全的人类轨迹中学习环境的神经模拟器来解决这个问题，然后使用学识渊博的模拟器从人类反馈中有效地学习奖励模型。但是，尚不清楚这种方法在复杂的3D环境中是否可行，并从真正的人类那里获得反馈 - 是否可以实现足够的基于像素的神经模拟器质量，以及人类数据要求在数量和质量方面是否可行。在本文中，我们以肯定的方式回答了这个问题，使用请求培训代理商，以完全使用人类承包商的数据执行3D第一人称对象收集任务。我们表明，与标准强化学习相比，所得代理表现出不安全行为的数量级降低。

Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题