具有有限共享资源武器的多武器多武器匪徒：学习算法和应用程序

论文标题

具有有限共享资源武器的多武器多武器匪徒：学习算法和应用程序

Multi-Player Multi-Armed Bandits with Finite Shareable Resources Arms: Learning Algorithms & Applications

论文作者

Wang, Xuchuang, Xie, Hong, Lui, John C. S.

论文摘要

多人多军匪徒（MMAB）研究分散的玩家如何合作地演奏相同的多臂强盗，以最大程度地提高其总累积奖励。现有的MMAB模型主要假设当多个玩家拉上同一手臂时，他们要么有碰撞并获得零奖励，要么没有碰撞并获得独立的奖励，而在实际情况下，这两者通常都过于限制。在本文中，我们提出了一个具有可共享资源的MMAB，以扩展碰撞和非碰撞设置。每个可共享的臂都有有限共享资源和“按负载”奖励随机变量，玩家都不知道。可共享的部门的奖励等于“每载”奖励乘以拉动手臂的球员数量与手臂最大共享资源之间的最小值。我们考虑两种反馈类型：共享需求信息（SDI）和共享需求意识（SDA），每种都提供了不同的资源共享信号。我们设计了DPE-SDI和SIC-SDA算法，分别在这两种反馈案例下解决了可共享的ARM问题，并证明这两种算法都具有对数的遗憾，这在巡回赛中很紧张。我们进行模拟以验证算法的性能并显示其在无线网络和边缘计算中的实用程序。

Multi-player multi-armed bandits (MMAB) study how decentralized players cooperatively play the same multi-armed bandit so as to maximize their total cumulative rewards. Existing MMAB models mostly assume when more than one player pulls the same arm, they either have a collision and obtain zero rewards, or have no collision and gain independent rewards, both of which are usually too restrictive in practical scenarios. In this paper, we propose an MMAB with shareable resources as an extension to the collision and non-collision settings. Each shareable arm has finite shareable resources and a "per-load" reward random variable, both of which are unknown to players. The reward from a shareable arm is equal to the "per-load" reward multiplied by the minimum between the number of players pulling the arm and the arm's maximal shareable resources. We consider two types of feedback: sharing demand information (SDI) and sharing demand awareness (SDA), each of which provides different signals of resource sharing. We design the DPE-SDI and SIC-SDA algorithms to address the shareable arm problem under these two cases of feedback respectively and prove that both algorithms have logarithmic regrets that are tight in the number of rounds. We conduct simulations to validate both algorithms' performance and show their utilities in wireless networking and edge computing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题