非统治马尔可夫决策过程的强化学习：（更多）乐观的祝福

论文标题

非统治马尔可夫决策过程的强化学习：（更多）乐观的祝福

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

论文作者

Cheung, Wang Chi, Simchi-Levi, David, Zhu, Ruihao

论文摘要

我们考虑在不平稳性的马尔可夫决策过程（MDP）中考虑未截止的强化学习（RL），即，只要通过合适的指标量化，奖励和状态过渡分布都可以随着时间的流逝而随着时间的流逝而发展，不超过一定的变量预算。我们首先开发了以置信度扩大（SWUCRL2-CW）算法的增强窗口学习的滑动窗口上的信心，并在已知变化预算时建立了动态遗憾。此外，我们提出了强烈的强化学习（BORL）算法，以适应SWUCRL2-CW算法，以实现相同的动态遗憾，但是以无参数的方式，即，不知道差异预算。值得注意的是，通过常规的乐观探索技术学习非平稳的MDP，这在现有（非平稳）强盗学习环境中提出了一个独特的挑战。我们通过一种新的置信度扩大技术来克服挑战，该技术融合了额外的乐观情绪。

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Notably, learning non-stationary MDPs via the conventional optimistic exploration technique presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题