论文标题

基于模型的强化学习具有价值定位的回归

Model-Based Reinforcement Learning with Value-Targeted Regression

论文作者

Ayoub, Alex, Jia, Zeyu, Szepesvari, Csaba, Wang, Mengdi, Yang, Lin F.

论文摘要

本文研究了基于模型的强化学习(RL),以最大程度地减少遗憾。我们专注于有限的情节rl,其中过渡模型$ p $属于已知的模型$ \ Mathcal {p} $的家族,一种特殊情况是$ \ Mathcal {p} $中的模型以线性混合物的形式:$p_θ= \ sum_ = \ sum_ = \ sum_ = 1} i = 1}^$ _}我们提出了一种基于乐观原则的基于模型的RL算法:在每个情节中,构建了与收集到的数据一致的模型集。一致性的标准基于总平方误差,该误差是根据沿过渡的最后一个值估计确定的预测\ emph {values}的任务的总平方误差。然后,通过使用构造的模型集解决乐观的计划问题来选择下一个值函数。我们对遗憾产生了束缚,在线性混合物的特殊情况下,遗憾的界限为$ \ tilde {\ Mathcal {o}}}(d \ sqrt {h^{3} t})$,其中$ h $,$ t $,$ t $和$ d $是$ h $,$ h $,$ h $ d $的总数,总数是$ $ $ $ $ simensive。特别是,这种遗憾与状态或动作的总数无关,并且接近下限$ω(\ sqrt {hdt})$。对于一个通用模型系列$ \ Mathcal {p} $,遗憾的界限是使用Russo&van Roy(2014)提出的所谓的Eluder维度的概念得出的。

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_θ = \sum_{i=1}^{d} θ_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $\tilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $θ$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $Ω(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源