什么时候可以观察到的强化学习不是可怕的？

论文标题

什么时候可以观察到的强化学习不是可怕的？

When Is Partially Observable Reinforcement Learning Not Scary?

论文作者

Liu, Qinghua, Chung, Alan, Szepesvári, Csaba, Jin, Chi

论文摘要

强化学习（RL）的应用，尽管缺乏有关受控系统潜在状态的完整信息，但在该公司中学习做出一系列决策，也就是说，它们在国家的部分观察性下起无处不在。众所周知，部分可观察到的RL可能很困难 - 众所周知的信息理论结果表明，在最坏情况下，学习部分可观察到的马尔可夫决策过程（POMDP）需要指数级的样本。然而，这并不排除在学习是可以解决的庞大的POMDP的大量子类的存在。在本文中，我们确定了这样的子类，我们称之为弱揭示的POMDP。该家庭排除了POMDP的病理实例，在某种程度上，观察结果是不明智的。我们证明，对于弱揭示的POMDP，一种简单的算法结合了乐观和最大似然估计（MLE）足以保证多项式样本的复杂性。据我们所知，这是从过度comdps中的相互作用中学习的第一个可证明的样本效率结果，其中潜在状态的数量可以大于观测值的数量。

Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题