论文标题
颠倒的增强学习可以在随机环境中以情节复位分歧
Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets
论文作者
论文摘要
颠倒的增强学习(UDRL)是解决不需要价值功能并仅使用监督学习的RL问题的方法,在数据集中给定输入的目标不会随时间而变化。 Ghosh等。证明了目标条件监督学习(GCSL)(可以看作是UDRL的简化版本)优化了对目标的下限。这提出了期望这种算法可以在任意环境中保证与最佳政策的保证融合,类似于某些众所周知的传统RL算法。在这里,我们表明,对于特定的情节UDRL算法(包括GCSL在内),情况并非如此,并给出了这种限制的原因。为此,我们首先引入有用的Eudrl重写,作为递归政策更新。这种公式有助于将其融合到广泛的随机环境中的最佳政策。最后,我们提供了一个非常简单的环境的具体示例,其中eudrl分歧。由于本文的主要目的是提出负面结果,最好的反例是最简单的结果,因此我们将所有讨论限制为有限的(离散)环境,忽略了功能近似问题和有限的样本量。
Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.