论文标题

决斗匪徒:从两场推销到多进场

Dueling Bandits: From Two-dueling to Multi-dueling

论文作者

Du, Yihan, Wang, Siwei, Huang, Longbo

论文摘要

我们研究了一个普遍的多冲销匪徒问题,在该问题中,代理人同时比较多个选项,并旨在最大程度地减少选择次优臂而导致的遗憾。此设置概括了传统的两冲销强盗问题,并找到了许多现实世界应用程序,涉及多个选项的主观反馈。我们从两场冲突的强盗设置开始,并提出了两种有效的算法,即Doubleerbai和MultisBM反馈。 DoubleBai提供了一种通用的架构,用于将最佳手臂识别算法上的已知结果转换为对决匪徒问题,并实现了$ O(\ ln T)$的遗憾。 MultisBM反馈不仅具有最佳的$ O(\ ln t)$遗憾,而且与基准结果相比,恒定因素的恒定因素将近一半。然后,我们考虑了一般的多推销案例,并开发了有效的算法Multirucb。使用新颖的有限时间遗憾分析针对一般的多冲销强盗问题,我们表明MultiRucb还获得了$ O(\ ln t)$遗憾的束缚,并且随着比较集的能力的增加,绑定的收紧了。基于合成数据集和现实数据集,我们从经验上证明,我们的算法表现优于现有算法。

We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options. We start with the two-dueling bandit setting and propose two efficient algorithms, DoublerBAI and MultiSBM-Feedback. DoublerBAI provides a generic schema for translating known results on best arm identification algorithms to the dueling bandit problem, and achieves a regret bound of $O(\ln T)$. MultiSBM-Feedback not only has an optimal $O(\ln T)$ regret, but also reduces the constant factor by almost a half compared to benchmark results. Then, we consider the general multi-dueling case and develop an efficient algorithm MultiRUCB. Using a novel finite-time regret analysis for the general multi-dueling bandit problem, we show that MultiRUCB also achieves an $O(\ln T)$ regret bound and the bound tightens as the capacity of the comparison set increases. Based on both synthetic and real-world datasets, we empirically demonstrate that our algorithms outperform existing algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源