通过策略奖励，可证明有效的离线多代理增强学习

论文标题

通过策略奖励，可证明有效的离线多代理增强学习

Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus

论文作者

Cui, Qiwen, Du, Simon S.

论文摘要

本文考虑了离线多代理增强学习。我们提出了策略的集中原则，该原则直接建立了联合策略的置信区间，与点的集中原则相反，该原理在联合行动空间中为每个点建立置信区间。对于两个玩家零和马尔可夫游戏，通过利用策略奖励的凸度，我们提出了一种计算有效的算法，其样本复杂性比基于点奖励的先前方法更好地依赖于动作的数量。此外，对于离线多代理通用和马尔可夫游戏，基于策略的奖励和新型的替代功能，我们给出了第一个算法，其样本复杂性仅缩放$ \ sum_ {i = 1}^ma_i $，其中$ a_i $是$ a_i $是$ i $ $ i $ $ - i $ - $ - $ m $ $ $ $ $ $ $ $是玩家的数量。相比之下，基于点奖金的方法的样本复杂性将随着关节动作空间的大小$π_{i = 1}^m a_i $而扩展，这是由于多元基因的诅咒。最后，我们所有的算法自然都可以将预先指定的策略类别$π$作为输入，并输出与$π$中最佳策略相近的策略。在这种情况下，示例复杂性仅用$ \ log |π| $而不是$ \ sum_ {i = 1}^ma_i $。

This paper considers offline multi-agent reinforcement learning. We propose the strategy-wise concentration principle which directly builds a confidence interval for the joint strategy, in contrast to the point-wise concentration principle that builds a confidence interval for each point in the joint action space. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose a computationally efficient algorithm whose sample complexity enjoys a better dependency on the number of actions than the prior methods based on the point-wise bonus. Furthermore, for offline multi-agent general-sum Markov games, based on the strategy-wise bonus and a novel surrogate function, we give the first algorithm whose sample complexity only scales $\sum_{i=1}^mA_i$ where $A_i$ is the action size of the $i$-th player and $m$ is the number of players. In sharp contrast, the sample complexity of methods based on the point-wise bonus would scale with the size of the joint action space $Π_{i=1}^m A_i$ due to the curse of multiagents. Lastly, all of our algorithms can naturally take a pre-specified strategy class $Π$ as input and output a strategy that is close to the best strategy in $Π$. In this setting, the sample complexity only scales with $\log |Π|$ instead of $\sum_{i=1}^mA_i$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题