从Dirichlet到Rubin：RL无奖金的乐观探索

论文标题

从Dirichlet到Rubin：RL无奖金的乐观探索

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

论文作者

Tiapkin, Daniil, Belomestny, Denis, Moulines, Eric, Naumov, Alexey, Samsonov, Sergey, Tang, Yunhao, Valko, Michal, Menard, Pierre

论文摘要

我们提出了用于在舞台，阶段依赖的，情节的马尔可夫决策过程中进行增强学习的贝叶斯-UCBVI算法：Kaufmann等人的贝叶斯-UCB算法的自然扩展。（2012年）用于多军匪徒。我们的方法将Q值函数后部的分位数用作最佳Q值函数上的上限。对于贝叶斯 - 乌克维（Bayes-ucbvi），我们证明了一个遗憾的是$ \ wideTilde {o}（\ sqrt {h^h^3sat}）$，其中$ h $是一集的长度，$ s $是$ s $的数量，$ a $ a $ a $ a $ the lumborting compance up the $ t $ a $ up up $ ups $ punds $ pult $ pults $ phots（ $ h，s，a，t $中的poly- $ \ log $ enter，适用于足够大的$ t $。据我们所知，这是第一种获得对地平线$ h $（和$ s $）最佳依赖性的算法，而无需涉及伯恩斯坦的奖金或噪音。对于我们的分析而言，至关重要的是一种新的细颗粒抗浓缩，以具有独立感兴趣的加权dirichlet总和。然后，我们解释了如何轻松地将贝叶斯 - UCBVI延伸到表格环境之外，从而在我们的算法和贝叶斯引导之间表现出牢固的联系（Rubin，1981）。

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $Ω(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).

下载PDF全文

下载文献需遵守相关版权规定

论文标题