$ o（t^{ - 1}）$在两人玩家零和马尔可夫游戏中乐观地遵循限制性领导者的收敛性

论文标题

$ o（t^{ - 1}）$在两人玩家零和马尔可夫游戏中乐观地遵循限制性领导者的收敛性

$O(T^{-1})$ Convergence of Optimistic-Follow-the-Regularized-Leader in Two-Player Zero-Sum Markov Games

论文作者

Yang, Yuepeng, Ma, Cong

论文摘要

我们证明，乐观的定制领导者（oftrl）以及平滑的价值更新发现了$ O（t^{ - 1}）$ - 在$ t $迭代中，两种玩家零 - 零马尔可夫游戏的$ T $迭代中的nash平衡以及全部信息。这改善了$ \ tilde {o}（t^{ - 5/6}）$收敛率最近在Paper Zhang等人（2022）中显示。精制分析取决于两种基本成分。首先，在马尔可夫游戏中，这两个玩家的遗憾总和不一定像普通形式的游戏中，虽然不一定是不负的。该属性使我们能够绑定学习动力学的二阶路径长度。其次，我们证明了对Oftrl部署的权重的更严格的代数不平等，该权重刮去了额外的$ \ log t $ factor。这种至关重要的改进实现了导致最终$ o（t^{ - 1}）$ rate的归纳分析。

We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an $O(T^{-1})$-approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence rate recently shown in the paper Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{-1})$ rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题