从极端的强盗反馈中学习

论文标题

从极端的强盗反馈中学习

Learning from eXtreme Bandit Feedback

论文作者

Lopez, Romain, Dhillon, Inderjit S., Jordan, Michael I.

论文摘要

我们研究了在极大的动作空间中从匪徒反馈中学习批处理的问题。在推荐系统中，从极端的匪徒反馈中学习无处不在，其中数十亿个决定在一天内由数百万选择组成，产生了大量的观察数据。在这些大规模的现实世界应用中，尽管有强大的反馈和受监督标签之间的不匹配，但仍广泛使用了受监督的学习框架，例如极端多标签分类（XMC）。重要性抽样技术可以减轻这种偏见，但是在处理大量动作时，这些技术会遭受不切实际的差异。在本文中，我们引入了一个选择性重要性采样估计器（SIS），该采样量（SIS）在明显更有利的偏见方差制度中运行。 SIS估计量是通过对奖励对每个实例的一小部分动作（一种rao-blackwellization的形式）进行奖励的有条件期望来获得的。我们在一种新颖的算法过程中使用该估计器（命名为“极限模型”（POXM）的策略优化），以从XMC任务的强盗反馈中学习。在POXM中，SIS估计量的选定操作是记录策略的顶级P动作，其中P从数据调整，并且明显小于动作空间的大小。我们在三个XMC数据集上使用监督到伴兰的转换来对我们的POXM方法进行基准测试三种竞争方法：Banditnet，以前应用的部分匹配的修剪策略和监督的学习基线。尽管有时在记录政策上有时会略有改善，但我们的实验表明，在所有基础线上，Poxm有系统地和显着改善。

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题