论文标题
基于变更检测的汤普森采样框架,用于非平板土匪
A Change-Detection Based Thompson Sampling Framework for Non-Stationary Bandits
论文作者
论文摘要
我们考虑一个非平稳的两臂强盗框架,并提出了基于变更检测的汤普森采样(TS)算法,称为带更改检测(TS-CD)的TS,以跟踪动态环境。非平稳性是使用泊松到达过程建模的,该过程改变了每个到达时的奖励的平均值。拟议的策略将手臂奖励的经验平均值与对其历史的奖励平均值进行了比较。当经验平均值与均值大于阈值的值偏离均值估计值时,它会检测到变化。然后,我们表征了在时间窗口的持续时间内,匪徒必须保持固定的时间范围内,以使TS-CD保持静止状态,以便在发生时成功地检测到变化。因此,我们的结果突出了泊松到达过程的参数的上限,TS-CD以高概率实现了渐近遗憾的最佳性。最后,我们通过测试无线网络中的无线电访问技术(大鼠)选择的边缘控制来验证TS-CD的功效。我们的结果表明,TS-CD不仅胜过经典的最大大鼠选择策略,而且还表现出其他为非平稳环境而设计的主动自适应和被动自适应强盗算法。
We consider a non-stationary two-armed bandit framework and propose a change-detection based Thompson sampling (TS) algorithm, named TS with change-detection (TS-CD), to keep track of the dynamic environment. The non-stationarity is modeled using a Poisson arrival process, which changes the mean of the rewards on each arrival. The proposed strategy compares the empirical mean of the recent rewards of an arm with the estimate of the mean of the rewards from its history. It detects a change when the empirical mean deviates from the mean estimate by a value larger than a threshold. Then, we characterize the lower bound on the duration of the time-window for which the bandit framework must remain stationary for TS-CD to successfully detect a change when it occurs. Consequently, our results highlight an upper bound on the parameter for the Poisson arrival process, for which the TS-CD achieves asymptotic regret optimality with high probability. Finally, we validate the efficacy of TS-CD by testing it for edge-control of radio access technique (RAT)-selection in a wireless network. Our results show that TS-CD not only outperforms the classical max-power RAT selection strategy but also other actively adaptive and passively adaptive bandit algorithms that are designed for non-stationary environments.