论文标题
从Dirichlet到Rubin:RL无奖金的乐观探索
From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses
论文作者
论文摘要
我们提出了用于在舞台,阶段依赖的,情节的马尔可夫决策过程中进行增强学习的贝叶斯-UCBVI算法:Kaufmann等人的贝叶斯-UCB算法的自然扩展。 (2012年)用于多军匪徒。我们的方法将Q值函数后部的分位数用作最佳Q值函数上的上限。对于贝叶斯 - 乌克维(Bayes-ucbvi),我们证明了一个遗憾的是$ \ wideTilde {o}(\ sqrt {h^h^3sat})$,其中$ h $是一集的长度,$ s $是$ s $的数量,$ a $ a $ a $ a $ the lumborting compance up the $ t $ a $ up up $ ups $ punds $ pult $ pults $ phots( $ h,s,a,t $中的poly- $ \ log $ enter,适用于足够大的$ t $。据我们所知,这是第一种获得对地平线$ h $(和$ s $)最佳依赖性的算法,而无需涉及伯恩斯坦的奖金或噪音。对于我们的分析而言,至关重要的是一种新的细颗粒抗浓缩,以具有独立感兴趣的加权dirichlet总和。然后,我们解释了如何轻松地将贝叶斯 - UCBVI延伸到表格环境之外,从而在我们的算法和贝叶斯引导之间表现出牢固的联系(Rubin,1981)。
We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $Ω(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).