有效的竞争自我竞争政策优化

论文标题

有效的竞争自我竞争政策优化

Efficient Competitive Self-Play Policy Optimization

论文作者

Zhong, Yuanyi, Zhou, Yuan, Peng, Jian

论文摘要

从自我玩耍中学习的强化学习最近报告了许多成功。代理商与自己竞争的自我游戏通常用于生成培训数据以改善迭代政策。在以前的工作中，启发式规则旨在为当前学习者选择对手。典型的规则包括选择最新的代理，最佳代理商或随机历史代理。但是，这些规则在实践中可能效率低下，有时甚至在最简单的矩阵游戏中也不能保证融合。在本文中，我们提出了一个新的算法框架，用于在两人零和游戏中进行竞争性自我竞争增强学习。我们认识到，NASH平衡与随机回报函数的马鞍点一致，这激发了我们从经典的鞍点优化文献中借用思想。我们的方法同时训练了几个代理，并根据基于原则性基于扰动的鞍优化方法得出的简单对抗规则智能地将对方互相训练。从理论上讲，我们证明我们的算法会在标准假设下收敛到凸连接游戏中的近似平衡，并且概率很高。除了理论之外，我们还进一步展示了我们方法比基线方法的经验优势，这些方法依赖于矩阵游戏，网格世界足球，Gomoku和模拟机器人Sumo中上述对手选择的启发式方法，具有神经网络策略函数。

Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. In this paper, we propose a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method trains several agents simultaneously, and intelligently takes each other as opponent based on simple adversarial rules derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponent-selection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题