论文标题
非零和线性二次深度结构化游戏中的强化学习:策略优化的全球融合
Reinforcement Learning in Nonzero-sum Linear Quadratic Deep Structured Games: Global Convergence of Policy Optimization
论文作者
论文摘要
我们研究了一类称为线性二次(LQ)深层结构化游戏的非零和随机动态游戏中基于模型的策略优化。在这样的游戏中,玩家通过一组国家和行动的加权平均值(线性回归)相互互动。在本文中,我们将注意力集中在同质的重量上。但是,对于无限人口的特殊情况,获得的结果扩展到渐近消失的权重,其中玩家学习了顺序加权平均平衡。尽管在政策领域的优化以及策略优化在游戏设置中通常不会融合的事实,但我们证明了拟议的基于模型的和无模型的策略梯度下降和自然政策梯度下降算法全球范围内融合了亚游戏完美的NASH平衡。据我们所知,这是在非零和LQ游戏中提供全球融合策略优化证明的第一个结果。提出的算法的显着特征之一是它们的参数空间独立于玩家的数量,并且当状态空间的维度明显大于动作空间的尺寸时,与在动作空间中计划和学习的算法相比,它们提供了一种更有效的计算方式。最后,提供了一些模拟来数字验证所获得的理论结果。
We study model-based and model-free policy optimization in a class of nonzero-sum stochastic dynamic games called linear quadratic (LQ) deep structured games. In such games, players interact with each other through a set of weighted averages (linear regressions) of the states and actions. In this paper, we focus our attention to homogeneous weights; however, for the special case of infinite population, the obtained results extend to asymptotically vanishing weights wherein the players learn the sequential weighted mean-field equilibrium. Despite the non-convexity of the optimization in policy space and the fact that policy optimization does not generally converge in game setting, we prove that the proposed model-based and model-free policy gradient descent and natural policy gradient descent algorithms globally converge to the sub-game perfect Nash equilibrium. To the best of our knowledge, this is the first result that provides a global convergence proof of policy optimization in a nonzero-sum LQ game. One of the salient features of the proposed algorithms is that their parameter space is independent of the number of players, and when the dimension of state space is significantly larger than that of the action space, they provide a more efficient way of computation compared to those algorithms that plan and learn in the action space. Finally, some simulations are provided to numerically verify the obtained theoretical results.