论文标题
随时为两人零和游戏的PSRO
Anytime PSRO for Two-Player Zero-Sum Games
论文作者
论文摘要
策略空间响应Oracles(PSRO)是一种多代理增强学习算法,在非常大的两人零和零和游戏中实现了最先进的性能。 PSRO基于表格双甲骨文(DO)方法,这是一种保证会收敛到NASH平衡的算法,但可能会增加从一种迭代到另一种迭代的可利用性。我们建议随时提出Double Oracle(ADO),这是一种针对2玩家零和游戏的表格双甲骨文算法,可以保证会收敛到NASH平衡,同时降低从一种迭代到下一个迭代的可利用性。与DO不同,在这种情况下,限制分布是基于每个玩家策略集形成的受限制游戏,ADO为每个玩家找到了限制分布,以最大程度地降低其在整个不受限制的游戏中对任何策略的可利用性。我们还提出了一种通过最佳响应(称为rm-br do)更新的无重格算法来查找此限制分布的方法。最后,我们提出任何时间PSRO(APSRO),这是ADO的一种版本,可以通过增强学习来计算最佳响应。在Leduc扑克和随机正常形式游戏的实验中,我们表明我们的方法的可剥削性要比DA和PSRO低得多,并且单调可降低可剥削性。
Policy space response oracles (PSRO) is a multi-agent reinforcement learning algorithm that has achieved state-of-the-art performance in very large two-player zero-sum games. PSRO is based on the tabular double oracle (DO) method, an algorithm that is guaranteed to converge to a Nash equilibrium, but may increase exploitability from one iteration to the next. We propose anytime double oracle (ADO), a tabular double oracle algorithm for 2-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from one iteration to the next. Unlike DO, in which the restricted distribution is based on the restricted game formed by each player's strategy sets, ADO finds the restricted distribution for each player that minimizes its exploitability against any policy in the full, unrestricted game. We also propose a method of finding this restricted distribution via a no-regret algorithm updated against best responses, called RM-BR DO. Finally, we propose anytime PSRO (APSRO), a version of ADO that calculates best responses via reinforcement learning. In experiments on Leduc poker and random normal form games, we show that our methods achieve far lower exploitability than DO and PSRO and decrease exploitability monotonically.