论文标题
关于离线增强学习的依赖差距的界限
On Gap-dependent Bounds for Offline Reinforcement Learning
论文作者
论文摘要
本文介绍了一项关于离线增强学习中依赖间隙依赖样品复杂性的系统研究。先前的工作表明,当最佳策略与行为策略之间的密度比上限(最佳策略覆盖范围假设)时,代理可以实现$ o \ left(\ frac {1} {1} {ε^2} \ right)$ rate,这也是最小值的最小值。我们在最佳策略覆盖范围假设下显示,当最佳$ q $ unction中存在积极的子临时差距时,可以将利率提高到$ o \ left(\ frac {1}ε\ right)$。此外,我们显示了行为策略的访问概率何时在最佳策略的访问概率为正(统一的最佳政策覆盖范围假设)的状态下,均匀下限,识别最佳政策的样本复杂性独立于$ \ frac {1}ε$。最后,我们呈现几乎匹配的下限,以补充我们的间隙依赖性上限。
This paper presents a systematic study on gap-dependent sample complexity in offline reinforcement learning. Prior work showed when the density ratio between an optimal policy and the behavior policy is upper bounded (the optimal policy coverage assumption), then the agent can achieve an $O\left(\frac{1}{ε^2}\right)$ rate, which is also minimax optimal. We show under the optimal policy coverage assumption, the rate can be improved to $O\left(\frac{1}ε\right)$ when there is a positive sub-optimality gap in the optimal $Q$-function. Furthermore, we show when the visitation probabilities of the behavior policy are uniformly lower bounded for states where an optimal policy's visitation probabilities are positive (the uniform optimal policy coverage assumption), the sample complexity of identifying an optimal policy is independent of $\frac{1}ε$. Lastly, we present nearly-matching lower bounds to complement our gap-dependent upper bounds.