通过策略 - 动力学价值功能快速适应

论文标题

通过策略 - 动力学价值功能快速适应

Fast Adaptation via Policy-Dynamics Value Functions

论文作者

Raileanu, Roberta, Goldstein, Max, Szlam, Arthur, Fergus, Rob

论文摘要

标准RL算法采用固定的环境动力学，需要大量的交互以适应新的环境。我们介绍了策略 - 动力学价值功能（PD-VF），这是一种快速适应与培训中先前看到的动力学不同的新方法。 PD-VF明确估计政策和环境空间中的累积奖励。常规RL政策的合奏用于在培训环境中收集经验，从中可以从中学习策略和环境的嵌入。然后，训练了在两个嵌入的条件下的值函数。在测试时，一些动作足以推断嵌入环境，从而可以通过最大化学习的值函数来选择策略（这不需要其他环境交互）。我们表明，我们的方法可以迅速适应一组Mujoco域上的新动力学。代码可在https://github.com/rraileanu/policy-dynamics-value-functions上找到。

Standard RL algorithms assume fixed environment dynamics and require a significant amount of interaction to adapt to new environments. We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training. PD-VF explicitly estimates the cumulative reward in a space of policies and environments. An ensemble of conventional RL policies is used to gather experience on training environments, from which embeddings of both policies and environments can be learned. Then, a value function conditioned on both embeddings is trained. At test time, a few actions are sufficient to infer the environment embedding, enabling a policy to be selected by maximizing the learned value function (which requires no additional environment interaction). We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains. Code available at https://github.com/rraileanu/policy-dynamics-value-functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题