决斗匪徒：从两场推销到多进场

论文标题

决斗匪徒：从两场推销到多进场

Dueling Bandits: From Two-dueling to Multi-dueling

论文作者

Du, Yihan, Wang, Siwei, Huang, Longbo

论文摘要

我们研究了一个普遍的多冲销匪徒问题，在该问题中，代理人同时比较多个选项，并旨在最大程度地减少选择次优臂而导致的遗憾。此设置概括了传统的两冲销强盗问题，并找到了许多现实世界应用程序，涉及多个选项的主观反馈。我们从两场冲突的强盗设置开始，并提出了两种有效的算法，即Doubleerbai和MultisBM反馈。 DoubleBai提供了一种通用的架构，用于将最佳手臂识别算法上的已知结果转换为对决匪徒问题，并实现了$ O（\ ln T）$的遗憾。 MultisBM反馈不仅具有最佳的$ O（\ ln t）$遗憾，而且与基准结果相比，恒定因素的恒定因素将近一半。然后，我们考虑了一般的多推销案例，并开发了有效的算法Multirucb。使用新颖的有限时间遗憾分析针对一般的多冲销强盗问题，我们表明MultiRucb还获得了$ O（\ ln t）$遗憾的束缚，并且随着比较集的能力的增加，绑定的收紧了。基于合成数据集和现实数据集，我们从经验上证明，我们的算法表现优于现有算法。

We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options. We start with the two-dueling bandit setting and propose two efficient algorithms, DoublerBAI and MultiSBM-Feedback. DoublerBAI provides a generic schema for translating known results on best arm identification algorithms to the dueling bandit problem, and achieves a regret bound of $O(\ln T)$. MultiSBM-Feedback not only has an optimal $O(\ln T)$ regret, but also reduces the constant factor by almost a half compared to benchmark results. Then, we consider the general multi-dueling case and develop an efficient algorithm MultiRUCB. Using a novel finite-time regret analysis for the general multi-dueling bandit problem, we show that MultiRUCB also achieves an $O(\ln T)$ regret bound and the bound tightens as the capacity of the comparison set increases. Based on both synthetic and real-world datasets, we empirically demonstrate that our algorithms outperform existing algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题