论文标题
对悲观主义的乐观情绪:超出渐近最优性的结构匪徒
Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality
论文作者
论文摘要
我们研究随机结构的土匪,以最大程度地减少遗憾。流行的乐观算法并未实现渐近实例的遗憾最优性(简称渐近的最优性),这一事实最近提到了研究人员。另一方面,众所周知,在某些情况下,人们可以实现有限的遗憾(即,不会无限期地使用$ n $增长)。不幸的是,现有的渐近最佳算法依赖于引入$ω(1)$ term W.R.T.的强制抽样。 Time Horizon $ n $在他们的遗憾中,无法适应实例的“简单性”。在本文中,我们专注于有限的假设案例,并询问是否可以达到渐近的最优性,同时尽可能享受有限的后悔。我们通过引入一种新算法,以悲观的态度(农作物)引入一种称为Crush Oppimism的新算法,从而消除了乐观的假设,从而消除了乐观的假设,从而消除了乐观的假设。我们的有限时间分析表明,农作物$(i)$实现了恒定的渐近性最优性,并且由于无强制探索的设计,$(ii)$适应有限的遗憾,而$(iii)$(iii)$(iii)$ not ta in $ k $,而是$ k $,而是使用有效的武器$ k_ar $ k_in $。我们还讨论了一个问题类别,其中农作物可以比\ textit {nonAsymptotic}制度中的现有算法更好。这个问题类还表明了一个令人惊讶的事实,即即使是根据渐近最佳的手臂拉力计划扮演的千里眼甲骨文,也可能会遭受线性最差的遗憾。
We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality (asymptotic optimality for short) has recently alluded researchers. On the other hand, it is known that one can achieve bounded regret (i.e., does not grow indefinitely with $n$) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an $ω(1)$ term w.r.t. the time horizon $n$ in their regret, failing to adapt to the "easiness" of the instance. In this paper, we focus on the finite hypothesis case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer by introducing a new algorithm called CRush Optimism with Pessimism (CROP) that eliminates optimistic hypotheses by pulling the informative arms indicated by a pessimistic hypothesis. Our finite-time analysis shows that CROP $(i)$ achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, $(ii)$ adapts to bounded regret, and $(iii)$ its regret bound scales not with $K$ but with an effective number of arms $K_ψ$ that we introduce. We also discuss a problem class where CROP can be exponentially better than existing algorithms in \textit{nonasymptotic} regimes. This problem class also reveals a surprising fact that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret.