在概率无限的对抗攻击下，坚固的随机匪徒算法

论文标题

在概率无限的对抗攻击下，坚固的随机匪徒算法

Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack

论文作者

Guan, Ziwei, Ji, Kaiyi, Bucci Jr, Donald J, Hu, Timothy Y, Palombo, Joseph, Liston, Michael, Liang, Yingbin

论文摘要

在各种攻击模型下，对多臂匪徒形式主义进行了广泛的研究，在这种模型中，对手可以修改向玩家揭示的奖励。先前的研究集中在攻击值在每轮界限或发生的可能性消失的情况下。这些模型不会捕获强大的对手，这些对手会灾难性地扰乱所揭示的奖励。本文调查了攻击模型，在该模型中，每个回合都具有一定概率的对手攻击，如果攻击值可以是任意且无限的，则它会攻击。此外，攻击值不一定遵循统计分布。我们提出了一种新型的基于样本的基于样本和探索的UCB算法（称为MED-E-UCB），以及一种基于中位数的$ε$ - 果岭算法（称为Med-$ $ $ε$ -Greedy）。这两种算法对上述攻击模型都非常强大。更具体地说，我们表明，这两种算法实现了$ \ Mathcal {O}（\ log T）$ pseudo-regret（即，没有攻击的最佳遗憾）。我们还提供了$ \ Mathcal {o}（\ log t）$遗憾的高概率保证。只要攻击概率不超过一定的恒定阈值，这些界限就可以在任意和无限的奖励扰动下实现。我们提供了拟议算法的多个综合模拟，以验证这些主张并展示现有技术以实现sublrinear遗憾的能力。我们还提供了使用多个软件定义的无线电在认知无线电设置中运行的算法的实验结果。

The multi-armed bandit formalism has been extensively studied under various attack models, in which an adversary can modify the reward revealed to the player. Previous studies focused on scenarios where the attack value either is bounded at each round or has a vanishing probability of occurrence. These models do not capture powerful adversaries that can catastrophically perturb the revealed reward. This paper investigates the attack model where an adversary attacks with a certain probability at each round, and its attack value can be arbitrary and unbounded if it attacks. Furthermore, the attack value does not necessarily follow a statistical distribution. We propose a novel sample median-based and exploration-aided UCB algorithm (called med-E-UCB) and a median-based $ε$-greedy algorithm (called med-$ε$-greedy). Both of these algorithms are provably robust to the aforementioned attack model. More specifically we show that both algorithms achieve $\mathcal{O}(\log T)$ pseudo-regret (i.e., the optimal regret without attacks). We also provide a high probability guarantee of $\mathcal{O}(\log T)$ regret with respect to random rewards and random occurrence of attacks. These bounds are achieved under arbitrary and unbounded reward perturbation as long as the attack probability does not exceed a certain constant threshold. We provide multiple synthetic simulations of the proposed algorithms to verify these claims and showcase the inability of existing techniques to achieve sublinear regret. We also provide experimental results of the algorithm operating in a cognitive radio setting using multiple software-defined radios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题