论文标题

线性回报的风险规避上下文的多武器匪徒问题

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

论文作者

Lin, Yifan, Wang, Yuhao, Zhou, Enlu

论文摘要

在本文中,我们考虑了在规避风险的标准下的线性收益的上下文多臂强盗问题。在每个回合中,每个手臂都会揭示上下文,决策者选择一只手臂拉并获得相应的奖励。特别是,我们将均值变化视为风险标准,最好的组是具有最大均值奖励的均值。我们将汤普森采样算法应用于脱节模型,并为提出的算法的变体提供了全面的遗憾分析。对于$ t $圆,$ k $的操作和$ d $二维功能向量,我们证明了$ o的遗憾(((1+ρ+\ frac {1} p) \ frac {1}ε})$在均值$ 1-δ$中,在带风险公差$ρ$的均值标准下,对于任何$ 0 <ε<\ frac {1} {2} {2} {2} $,$ 0 <Δ<1 $。我们提出的算法的经验性能通过投资组合选择问题来证明。

In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For $T$ rounds, $K$ actions, and $d$-dimensional feature vectors, we prove a regret bound of $O((1+ρ+\frac{1}ρ) d\ln T \ln \frac{K}δ\sqrt{d K T^{1+2ε} \ln \frac{K}δ \frac{1}ε})$ that holds with probability $1-δ$ under the mean-variance criterion with risk tolerance $ρ$, for any $0<ε<\frac{1}{2}$, $0<δ<1$. The empirical performance of our proposed algorithms is demonstrated via a portfolio selection problem.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源