论文标题
通过功能近似来保证Epsilon-Greedy增强学习
Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation
论文作者
论文摘要
Epsilon-Greedy,SoftMax或Gaussian噪声等近视探索政策在某些强化学习任务中无法有效地探索,但是在许多其他方面,它们的表现都很好。实际上,在实践中,由于简单性,通常会选择它们作为最佳选择。但是,对于这些政策的哪些任务成功?我们可以为他们的有利表现提供理论保证吗?尽管这些政策具有显着的实际重要性,但几乎没有研究这些关键问题。本文对此类政策进行了理论分析,并通过近视探索为增强学习提供了第一个遗憾和样本复杂性。我们的结果适用于具有有限的Bellman Eluder维度的基于价值功能的算法。我们提出了一种新的复杂度度量,称为近视探索差距,用Alpha表示,该差距捕获了MDP的结构属性,勘探策略和给定的值函数类别。我们表明,近视探索的样品复杂性与该数量的倒数1 / alpha^2相反。我们通过具体的例子进一步证明,由于相应的动态和奖励结构,在近视探索成功的几项任务中,近视探索差距确实是有利的。
Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman Eluder dimension. We propose a new complexity measure called myopic exploration gap, denoted by alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that myopic exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure.