使用决策估计系数的无模型增强学习

论文标题

使用决策估计系数的无模型增强学习

Model-Free Reinforcement Learning with the Decision-Estimation Coefficient

论文作者

Foster, Dylan J., Golowich, Noah, Qian, Jian, Rakhlin, Alexander, Sekhari, Ayush

论文摘要

我们考虑了交互式决策，涵盖结构化的匪徒和加强近似的加固学习问题。最近，Foster等。（2021）介绍了决策估计系数，这是一种统计复杂性的度量，它降低了交互式决策的最佳遗憾以及元偏值，估计到决策，从而以相同数量的范围实现了上限。估算到决策是一种减少，它可以将（监督）在线估计算法提升为决策算法。在本文中，我们表明，通过将估算到决策与张（2022）引入的专业的乐观估计形式相结合，可以获得对Foster等人的改进的保证。（2021）通过容纳更宽大的估计误差概念。我们使用这种方法来得出遗憾的界限，以通过价值函数近似为无模型的强化学习，并给出结构性结果，显示何时可以并且无法更一般地帮助。

We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this paper, we show that by combining Estimation-to-Decisions with a specialized form of optimistic estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally.

下载PDF全文

下载文献需遵守相关版权规定

论文标题