线性回报的风险规避上下文的多武器匪徒问题

论文标题

线性回报的风险规避上下文的多武器匪徒问题

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

论文作者

Lin, Yifan, Wang, Yuhao, Zhou, Enlu

论文摘要

在本文中，我们考虑了在规避风险的标准下的线性收益的上下文多臂强盗问题。在每个回合中，每个手臂都会揭示上下文，决策者选择一只手臂拉并获得相应的奖励。特别是，我们将均值变化视为风险标准，最好的组是具有最大均值奖励的均值。我们将汤普森采样算法应用于脱节模型，并为提出的算法的变体提供了全面的遗憾分析。对于$ t $圆，$ k $的操作和$ d $二维功能向量，我们证明了$ o的遗憾（（（1+ρ+\ frac {1} p） \ frac {1}ε}）$在均值$ 1-δ$中，在带风险公差$ρ$的均值标准下，对于任何$ 0 <ε<\ frac {1} {2} {2} {2} $，$ 0 <Δ<1 $。我们提出的算法的经验性能通过投资组合选择问题来证明。

In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For $T$ rounds, $K$ actions, and $d$-dimensional feature vectors, we prove a regret bound of $O((1+ρ+\frac{1}ρ) d\ln T \ln \frac{K}δ\sqrt{d K T^{1+2ε} \ln \frac{K}δ \frac{1}ε})$ that holds with probability $1-δ$ under the mean-variance criterion with risk tolerance $ρ$, for any $0<ε<\frac{1}{2}$, $0<δ<1$. The empirical performance of our proposed algorithms is demonstrated via a portfolio selection problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题