深度平衡发现的统一视角

论文标题

深度平衡发现的统一视角

A Unified Perspective on Deep Equilibrium Finding

论文作者

Wang, Xinrun, Cerny, Jakub, Li, Shuxin, Yang, Chang, Yin, Zhuyun, Chan, Hau, An, Bo

论文摘要

广泛形式的游戏提供了一个多功能框架，用于建模经过不完善观察和随机事件的多个代理的相互作用。近年来，两个范式，政策空间响应Oracles（PSRO）和反事实遗憾最小化（CFR）表明，确实可以有效地解决广泛的游戏。他们俩都能够利用深层神经网络来解决广泛形式游戏固有的可伸缩性问题，我们将其称为深度均衡的算法。即使PSRO和CFR具有某些相似之处，它们通常被认为是独特的，并且对其优于另一个问题的问题的答案仍然模棱两可。我们没有直接回答这个问题，而是在这项工作中提出了对深度平衡发现的统一观点，该发现概述了PSRO和CFR。我们的四个主要贡献包括：i）一种新的响应甲骨文（RO），该响应oracle（RO）计算Q值以及达到概率值和基线值； ii）两个转换模块 - 一个前变换和一个变换后 - 由神经网络代表，将RO的输出转换为潜在的附加空间（LAS），然后将LAS转换为执行的行动概率； iii）两个平均甲骨文 - 局部平均甲骨文（LAO）和全球平均甲骨文（GAO） - 老挝在LAS和GAO上运行，仅用于评估； iv）一种受虚拟游戏启发的新颖方法，可优化变换模块和平均牙齿，并自动选择两个框架组件的最佳组合。在Leduc Poker游戏上进行的实验表明，我们的方法可以胜过这两个框架。

Extensive-form games provide a versatile framework for modeling interactions of multiple agents subjected to imperfect observations and stochastic events. In recent years, two paradigms, policy space response oracles (PSRO) and counterfactual regret minimization (CFR), showed that extensive-form games may indeed be solved efficiently. Both of them are capable of leveraging deep neural networks to tackle the scalability issues inherent to extensive-form games and we refer to them as deep equilibrium-finding algorithms. Even though PSRO and CFR share some similarities, they are often regarded as distinct and the answer to the question of which is superior to the other remains ambiguous. Instead of answering this question directly, in this work we propose a unified perspective on deep equilibrium finding that generalizes both PSRO and CFR. Our four main contributions include: i) a novel response oracle (RO) which computes Q values as well as reaching probability values and baseline values; ii) two transform modules -- a pre-transform and a post-transform -- represented by neural networks transforming the outputs of RO to a latent additive space (LAS), and then the LAS to action probabilities for execution; iii) two average oracles -- local average oracle (LAO) and global average oracle (GAO) -- where LAO operates on LAS and GAO is used for evaluation only; and iv) a novel method inspired by fictitious play that optimizes the transform modules and average oracles, and automatically selects the optimal combination of components of the two frameworks. Experiments on Leduc poker game demonstrate that our approach can outperform both frameworks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题