样品有效的加固学习底部POMDP

论文标题

样品有效的加固学习底部POMDP

Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

论文作者

Jin, Chi, Kakade, Sham M., Krishnamurthy, Akshay, Liu, Qinghua

论文摘要

在许多强化学习应用中，部分可观察性是一个普遍的挑战，它要求代理保持记忆，推断潜在状态并将过去的信息整合到探索中。这项挑战会导致许多计算和统计硬度结果，用于学习一般可观察到的马尔可夫决策过程（POMDPS）。这项工作表明，这些硬度障碍并不排除有效的庞大型庞然大物的有效加强学习。特别是，我们提出了一种样品效率的算法OOM-UCB，用于情节有限的底层POMDP，其中观测值大于潜在状态的数量，并且在探索中对学习至关重要，从而将我们的结果与先前的作品区分开。 OOM-UCB实现了$ \ tilde {\ Mathcal {o}}}（1/\ varepsilon^2）$的最佳样本复杂性，用于查找$ \ varepsilon $ -optimal-optimal polity，以及在所有其他相关数量中都是多项式的。作为一个有趣的特殊情况，我们还为具有确定性状态过渡的POMDP提供了一种计算和统计上有效的算法。

Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of POMDPs. In particular, we present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works. OOM-UCB achieves an optimal sample complexity of $\tilde{\mathcal{O}}(1/\varepsilon^2)$ for finding an $\varepsilon$-optimal policy, along with being polynomial in all other relevant quantities. As an interesting special case, we also provide a computationally and statistically efficient algorithm for POMDPs with deterministic state transitions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题