论文标题
季节性环境的线性匪徒
A Linear Bandit for Seasonal Environments
论文作者
论文摘要
上下文强盗算法非常受欢迎,并且在推荐系统中广泛使用,以提供在线个性化建议。循环假设是奖励函数的平稳性,在大多数现实世界中,这是不现实的。例如,在音乐推荐方案中,人们的音乐品味在某些事件(例如万圣节或圣诞节)中可能会突然改变,并在不久之后恢复到以前的音乐品味。 因此,我们需要一种算法,该算法可以迅速对这些更改做出反应。此外,我们希望利用在不同的固定期内收集的已经观察到的奖励,这些固定时期可能会重复发生,而无需从头开始重新启动学习过程。越来越多的文献解决了奖励的非平稳性问题,提供了可以迅速适应不断变化的环境的算法。但是,据我们所知,没有算法处理奖励功能的季节性变化。在这里,我们提出了一种上下文匪徒算法,该算法可检测并适应奖励函数的突然变化,并且每当环境恢复到先前观察到的状态时,就会利用先前的估计。我们表明,所提出的方法可以胜过非平稳环境的最先进算法。我们对合成数据集和真实数据集进行了实验。
Contextual bandit algorithms are extremely popular and widely used in recommendation systems to provide online personalised recommendations. A recurrent assumption is the stationarity of the reward function, which is rather unrealistic in most of the real-world applications. In the music recommendation scenario for instance, people's music taste can abruptly change during certain events, such as Halloween or Christmas, and revert to the previous music taste soon after. We would therefore need an algorithm which can promptly react to these changes. Moreover, we would like to leverage already observed rewards collected during different stationary periods which can potentially reoccur, without the need of restarting the learning process from scratch. A growing literature has addressed the problem of reward's non-stationarity, providing algorithms that could quickly adapt to the changing environment. However, up to our knowledge, there is no algorithm which deals with seasonal changes of the reward function. Here we present a contextual bandit algorithm which detects and adapts to abrupt changes of the reward function and leverages previous estimations whenever the environment falls back to a previously observed state. We show that the proposed method can outperform state-of-the-art algorithms for non-stationary environments. We ran our experiment on both synthetic and real datasets.