论文标题
在非组织马尔可夫决策过程中继续执行任务的政策梯度
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes
论文作者
论文摘要
强化学习考虑了寻找政策的问题,这些政策在马尔可夫决策过程中最大程度地提高了预期的累积奖励,并以未知的过渡概率。在本文中,我们考虑了找到最佳政策的问题,假设它们属于繁殖的内核希尔伯特空间(RKHS)。为此,我们计算了价值函数的无偏随机梯度,我们将其用作更新策略的上升方向。策略梯度类型算法的主要缺点是,除非实施平稳性假设,否则它们仅限于情节任务。因此,防止这些算法在线实现,这是需要适应新任务和/或部署环境的系统的理想属性。策略梯度算法工作的主要要求是,在任何时间点的梯度估计是初始值函数的上升方向。在这项工作中,我们确定确实是这种情况,可以显示在线算法与初始值函数的关键点的收敛性。一个数值示例显示了我们在线算法学习解决导航和监视问题的能力,在该问题中,代理必须在该问题之间循环到目标位置。这个示例证实了我们关于随后随机梯度的上升方向的理论发现。它还显示了运行我们在线算法的代理如何成功地学习导航,此前循环轨迹持续不符合文献中非情节培训的标准平稳性假设。
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent direction for the initial value function. In this work we establish that indeed this is the case which enables to show the convergence of the online algorithm to the critical points of the initial value function. A numerical example shows the ability of our online algorithm to learn to solve a navigation and surveillance problem, in which an agent must loop between to goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how the agent running our online algorithm succeeds in learning to navigate, following a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non episodic training.