论文标题

量化奖励功能的差异

Quantifying Differences in Reward Functions

论文作者

Gleave, Adam, Dennis, Michael, Legg, Shane, Russell, Stuart, Leike, Jan

论文摘要

对于许多任务,奖励功能无法及格,或者太复杂而无法按程序指定,必须从用户数据中学到。先前的工作已经通过评估为学习奖励优化的政策评估了学习的奖励功能。但是,此方法无法区分学习的奖励功能无法反映用户偏好和策略优化过程未能优化学习的奖励。此外,这种方法只能告诉我们评估环境中的行为,但是在部署环境中,奖励可能会激励截然不同的行为。为了解决这些问题,我们介绍了等效的政策不变比较(EPIC)距离,以直接量化两个奖励函数之间的差异,而无需策略优化步骤。我们证明,Epic在始终引起相同最佳政策的奖励功能的等效奖励功能上是不变的。此外,我们发现史诗可以有效地近似,并且比基准更适合覆盖分布的选择。最后,我们表明,即使在不同的过渡动态下,史诗般的距离也限制了最佳政策的遗憾,并且我们从经验上确认它可以预测政策培训的成功。我们的源代码可在https://github.com/humancompatibleai/evaluating-rewards上找到。

For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https://github.com/HumanCompatibleAI/evaluating-rewards.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源