论文标题
贝叶斯强化学习的信息理论分析
An Information-Theoretic Analysis of Bayesian Reinforcement Learning
论文作者
论文摘要
在Xu和Raginksy [1]引入的框架基础上,我们研究了基于模型的贝叶斯强化学习问题的最佳实现性能。为此,我们将最小贝叶斯遗憾(MBR)定义为可通过从收集的数据中学习或了解环境及其动态来获得的最大预期累积奖励之间的差异。我们将此定义专门用于强化学习问题,该问题以马尔可夫决策过程(MDP)为模型,其内核参数是代理未知的,并且其不确定性由先前的分布表示。提出了一种推导MBR上上限的方法,并根据相对熵和Wasserstein距离进行特定边界。然后,我们专注于两个特定的MDP案例,即多军匪徒问题(MAB)以及与部分反馈问题的在线优化。对于后一个问题,我们表明我们的界限可以从Russo和Van Roy [2]的当前信息理论界限以下恢复。
Building on the framework introduced by Xu and Raginksy [1] for supervised learning problems, we study the best achievable performance for model-based Bayesian reinforcement learning problems. With this purpose, we define minimum Bayesian regret (MBR) as the difference between the maximum expected cumulative reward obtainable either by learning from the collected data or by knowing the environment and its dynamics. We specialize this definition to reinforcement learning problems modeled as Markov decision processes (MDPs) whose kernel parameters are unknown to the agent and whose uncertainty is expressed by a prior distribution. One method for deriving upper bounds on the MBR is presented and specific bounds based on the relative entropy and the Wasserstein distance are given. We then focus on two particular cases of MDPs, the multi-armed bandit problem (MAB) and the online optimization with partial feedback problem. For the latter problem, we show that our bounds can recover from below the current information-theoretic bounds by Russo and Van Roy [2].