基于参数的值函数

论文标题

基于参数的值函数

Parameter-Based Value Functions

论文作者

Faccio, Francesco, Kirsch, Louis, Schmidhuber, Jürgen

论文摘要

传统的政演员批判性增强学习（RL）算法学习单个目标策略的价值功能。但是，当更新价值功能以跟踪学习策略时，他们会忘记有关旧政策的潜在有用信息。我们介绍了一个称为基于参数的值函数（PBVF）的值函数的类别，其输入包括策略参数。他们可以跨越不同的政策。 PBVF可以评估给定状态，州行动对或RL代理初始状态上的任何策略的绩效。首先，我们展示PBVF如何产生新型的非政策阶段定理。然后，我们基于由蒙特卡洛或时间差异方法训练的PBVF来得出政府批评算法。我们展示了学到的PBVF如何零射击学习新政策，以优于培训期间看到的任何政策。最后，我们使用浅层策略和深层神经网络对我们的离散和连续控制任务进行了评估，对我们的算法进行了评估。它们的性能与最新方法相媲美。

Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题