生存强盗问题

论文标题

生存强盗问题

The Survival Bandit Problem

论文作者

Riou, Charles, Honda, Junya, Sugiyama, Masashi

论文摘要

我们介绍并研究了多军匪徒问题（MAB）的新变体，称为生存匪徒问题（S-MAB）。在这两个问题上，目的是最大化所谓的累积奖励，但在这种新变体中，如果累积奖励低于预设阈值，则该过程会中断。 MAB的这种简单但未开发的扩展是从许多实际应用中遵循的。例如，当对自愿患者相互测试两种药物时，人们的健康受到威胁，如果发生严重的副作用，或者如果治疗未消除疾病综合症，则必须能够中断实验。从理论的角度来看，S-MAB是mab的第一个变体，该过程可能会中断或不会中断。我们首先将S-MAB正式化，然后将其目标定义为所谓的生存后悔的最小化，这自然会概括MAB的遗憾。然后，我们表明，S-MAB的目标比MAB要困难得多，从某种意义上说，与MAB相反，没有任何政策可以实现相当小的（即sublinear）生存的遗憾。取而代之的是，我们从帕累托（Pareto）的意义上最大程度地减少了生存后悔，即，我们寻求一项政策，其累计奖励在某些问题实例而不为另一个问题牺牲的情况下无法得到改善。为此，我们确定了生存后悔中的两个关键组成部分：没有毁灭的遗憾（与MAB中的遗憾相对应），并且该程序被中断的可能性称为毁灭的可能性。我们在毁灭的概率上得出了一个下限，以及其毁灭概率与下限匹配的策略。最后，基于对这些政策的加倍窍门，我们得出了一项政策，从而将帕累托（Perotto）等人的帕托（Pareto）意义上的生存后悔最小化，从而回答了Perotto等人的开放问题。（Colt 2019）。

We introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB). While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold. This simple yet unexplored extension of the MAB follows from many practical applications. For example, when testing two medicines against each other on voluntary patients, people's health are at stake, and it is necessary to be able to interrupt experiments if serious side effects occur or if the disease syndromes are not dissipated by the treatment. From a theoretical perspective, the S-MAB is the first variant of the MAB where the procedure may or may not be interrupted. We start by formalizing the S-MAB and we define its objective as the minimization of the so-called survival regret, which naturally generalizes the regret of the MAB. Then, we show that the objective of the S-MAB is considerably more difficult than the MAB, in the sense that contrary to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival regret. Instead, we minimize the survival regret in the sense of Pareto, i.e., we seek a policy whose cumulative reward cannot be improved for some problem instance without being sacrificed for another one. For that purpose, we identify two key components in the survival regret: the regret given no ruin (which corresponds to the regret in the MAB), and the probability that the procedure is interrupted, called the probability of ruin. We derive a lower bound on the probability of ruin, as well as policies whose probability of ruin matches the lower bound. Finally, based on a doubling trick on those policies, we derive a policy which minimizes the survival regret in the sense of Pareto, giving an answer to an open problem by Perotto et al. (COLT 2019).

下载PDF全文

下载文献需遵守相关版权规定

论文标题