论文标题
终结者的加强学习
Reinforcement Learning with a Terminator
论文作者
论文摘要
我们提出了通过外源终止进行加强学习的问题。我们定义了Markov决策过程(TERMDP),即MDP框架的扩展,其中可能会被外部非马克维亚观察者中断。这种表述说明了许多现实世界中的情况,例如,由于不适的原因,人类打断了自主驾驶剂。我们了解TermDP的参数,并利用估计问题的结构来提供州智慧的置信界。我们使用这些来构建一种可验证的算法,该算法解释了终止并束缚其后悔。在我们的理论分析中,我们设计和实施了一种可扩展的方法,该方法结合了乐观(W.R.T.终止)和动态折现因子,并结合了终止概率。我们将我们的方法部署在高维驾驶和牛仔基准上。此外,我们在驾驶环境中测试了人类数据的方法。我们的结果表明,在各种基线方法中,快速收敛和显着改善。
We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w.r.t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.