D3C：降低多机构学习中无政府状态的价格

论文标题

D3C：降低多机构学习中无政府状态的价格

D3C: Reducing the Price of Anarchy in Multi-Agent Learning

论文作者

Gemp, Ian, McKee, Kevin R., Everett, Richard, Duéñez-Guzmán, Edgar A., Bachrach, Yoram, Balduzzi, David, Tacchetti, Andrea

论文摘要

在多基因系统中，固定激励措施的复杂相互作用会导致代理无法获得较差（效率低下）的结果，而不仅对组，而且对每个人来说也是如此。无政府状态的价格是一种技术，游戏理论的定义，可以量化这些情况下产生的低效率 - 它比较了可以通过完美的协调与NASH平衡处的自我利益代理实现的福利。我们根据无政府状态的价格得出了一个可区分的上限，在学习过程中，代理商可以便宜地估计。配备该估计器的代理可以以提高NASH平衡效率的方式调整激励措施。代理人通过学习将其奖励（等价损失）与其他代理的奖励相结合，以遵循我们派生的上限的梯度来混合奖励（等价负损失）。我们将这种方法称为D3C。如果代理人激励措施是可区分的，D3C类似于行为游戏理论中著名的获胜，损失换档策略，从而在全球最大福利的全球目标与既定的以代理商为中心的学习规则之间建立了联系。在多种强化学习中常见的情况下，我们表明可以通过进化策略减少上限，直到以分布式方式达成妥协为止。我们证明，D3C在几个社会困境中改善了每个代理商和整个团体的成果，包括交通网络，展示了Braess的悖论，一个囚犯的困境和几个多重域名。

In multiagent systems, the complex interaction of fixed incentives can lead agents to outcomes that are poor (inefficient) not only for the group, but also for each individual. Price of anarchy is a technical, game-theoretic definition that quantifies the inefficiency arising in these scenarios -- it compares the welfare that can be achieved through perfect coordination against that achieved by self-interested agents at a Nash equilibrium. We derive a differentiable, upper bound on a price of anarchy that agents can cheaply estimate during learning. Equipped with this estimator, agents can adjust their incentives in a way that improves the efficiency incurred at a Nash equilibrium. Agents do so by learning to mix their reward (equiv. negative loss) with that of other agents by following the gradient of our derived upper bound. We refer to this approach as D3C. In the case where agent incentives are differentiable, D3C resembles the celebrated Win-Stay, Lose-Shift strategy from behavioral game theory, thereby establishing a connection between the global goal of maximum welfare and an established agent-centric learning rule. In the non-differentiable setting, as is common in multiagent reinforcement learning, we show the upper bound can be reduced via evolutionary strategies, until a compromise is reached in a distributed fashion. We demonstrate that D3C improves outcomes for each agent and the group as a whole on several social dilemmas including a traffic network exhibiting Braess's paradox, a prisoner's dilemma, and several multiagent domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题