论文标题

通过强盗反馈分散的在线游戏NASH Equilibria学习

Decentralized Nash Equilibria Learning for Online Game with Bandit Feedback

论文作者

Meng, Min, Li, Xiuxian, Chen, Jie

论文摘要

本文研究为在线游戏中分发了纳什均衡性的在线匪徒学习,其中所有玩家的成本功能和耦合约束都是及时的。逐渐向本地玩家揭示了成本和本地约束功能的价值观而不是全部信息。每个玩家的目标是自私地最大程度地减少其自身的成本功能,而没有将来的信息受到策略设置约束和随时间变化的不平等约束的约束。为此,基于镜像下降的分布式在线算法和单点强盗反馈旨在寻求在线游戏的纳什均衡状态。结果表明,如果广义NASH平衡序列的路径变化是sublerear的,则设计的在线算法实现了sublinear预期的遗憾和累积的约束侵犯。此外,提出的算法扩展到了延迟的匪徒反馈的情况,即,成本和约束功能的值通过时间延迟披露给本地玩家。还证明,在某些条件下,在路径变化和延迟的情况下,具有延迟匪徒反馈的在线算法仍然具有标准的预期遗憾和累积的约束违规。提出模拟以说明理论结果的效率。

This paper studies distributed online bandit learning of generalized Nash equilibria for online game, where cost functions of all players and coupled constraints are time-varying. The values rather than full information of cost and local constraint functions are revealed to local players gradually. The goal of each player is to selfishly minimize its own cost function with no future information subject to a strategy set constraint and time-varying coupled inequality constraints. To this end, a distributed online algorithm based on mirror descent and one-point bandit feedback is designed for seeking generalized Nash equilibria of the online game. It is shown that the devised online algorithm achieves sublinear expected regrets and accumulated constraint violation if the path variation of the generalized Nash equilibrium sequence is sublinear. Furthermore, the proposed algorithm is extended to the scenario of delayed bandit feedback, that is, the values of cost and constraint functions are disclosed to local players with time delays. It is also demonstrated that the online algorithm with delayed bandit feedback still has sublinear expected regrets and accumulated constraint violation under some conditions on the path variation and delay. Simulations are presented to illustrate the efficiency of theoretical results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源