关于弱通信的MDP中平均奖励的非政策外控制算法的收敛

论文标题

关于弱通信的MDP中平均奖励的非政策外控制算法的收敛

On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs

论文作者

Wan, Yi, Sutton, Richard S.

论文摘要

我们展示了两种平均奖励的非政策外控制算法，差异Q学习（WAN，NAIK和SUTTON 2021A）和RVI Q-LEARNNING（ABOUNADI BERTSEKAS＆BORKAR 2001），在弱通信的MDP中收敛。较弱的通信MDP是最通用的MDP，可以通过具有单个经验流的学习算法来解决。两种算法的原始收敛证明要求平均奖励最优方程的解决方案集仅具有一个自由度，这对于弱通信MDP不一定是正确的。据我们所知，我们的结果是第一个显示平均奖励的非政策控制算法在弱通信的MDP中收敛。作为直接扩展，我们表明WAN，NAIK和SUTTON（2021b）引入的时间抽象的平均奖励期权算法如果选项引起的半MDP的通信较弱，则会收敛。

We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs. Weakly communicating MDPs are the most general MDPs that can be solved by a learning algorithm with a single stream of experience. The original convergence proofs of the two algorithms require that the solution set of the average-reward optimality equation only has one degree of freedom, which is not necessarily true for weakly communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly communicating MDPs. As a direct extension, we show that average-reward options algorithms for temporal abstraction introduced by Wan, Naik, & Sutton (2021b) converge if the Semi-MDP induced by options is weakly communicating.

下载PDF全文

下载文献需遵守相关版权规定

论文标题