基于模型的强化学习具有价值定位的回归

论文标题

基于模型的强化学习具有价值定位的回归

Model-Based Reinforcement Learning with Value-Targeted Regression

论文作者

Ayoub, Alex, Jia, Zeyu, Szepesvari, Csaba, Wang, Mengdi, Yang, Lin F.

论文摘要

本文研究了基于模型的强化学习（RL），以最大程度地减少遗憾。我们专注于有限的情节rl，其中过渡模型$ p $属于已知的模型$ \ Mathcal {p} $的家族，一种特殊情况是$ \ Mathcal {p} $中的模型以线性混合物的形式：$p_θ= \ sum_ = \ sum_ = \ sum_ = 1} i = 1}^$ _}我们提出了一种基于乐观原则的基于模型的RL算法：在每个情节中，构建了与收集到的数据一致的模型集。一致性的标准基于总平方误差，该误差是根据沿过渡的最后一个值估计确定的预测\ emph {values}的任务的总平方误差。然后，通过使用构造的模型集解决乐观的计划问题来选择下一个值函数。我们对遗憾产生了束缚，在线性混合物的特殊情况下，遗憾的界限为$ \ tilde {\ Mathcal {o}}}（d \ sqrt {h^{3} t}）$，其中$ h $，$ t $，$ t $和$ d $是$ h $，$ h $，$ h $ d $的总数，总数是$ $ $ $ $ simensive。特别是，这种遗憾与状态或动作的总数无关，并且接近下限$ω（\ sqrt {hdt}）$。对于一个通用模型系列$ \ Mathcal {p} $，遗憾的界限是使用Russo＆van Roy（2014）提出的所谓的Eluder维度的概念得出的。

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_θ = \sum_{i=1}^{d} θ_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $\tilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $θ$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $Ω(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

下载PDF全文

下载文献需遵守相关版权规定

论文标题