事实证明，对风险敏感的增强学习：迭代的CVAR和最差的道路

论文标题

事实证明，对风险敏感的增强学习：迭代的CVAR和最差的道路

Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path

论文作者

Du, Yihan, Wang, Siwei, Huang, Longbo

论文摘要

在本文中，我们研究了一个新型的情节风险敏感的增强学习（RL）问题，该问题命名为迭代的CVAR RL，该问题旨在在每个步骤中最大程度地提高奖励的尾巴，并着重于在每个阶段紧密控制进入灾难性情况的风险。该公式适用于在整个决策过程中要求强烈避免风险的现实世界任务，例如自主驾驶，临床治疗计划和机器人技术。我们调查了迭代CVAR RL下的两个性能指标，即后悔的最小化和最佳政策识别。对于这两个指标，我们分别设计有效的算法ICVAR-RM和ICVAR-BPI，并相对于$ k $的发作数量提供了几乎匹配的上和下限。我们还研究了一个有趣的局限性CVAR RL的限制案例，称为最差路径RL，该目标成为最大化最小可能的累积奖励的目标。对于最坏的路径RL，我们提出了一种具有恒定上限和下限的有效算法。最后，由于价值函数转移和通过扭曲的访问分布分解遗憾，我们界定了CVAR变化的技术是新颖的，并且可以在其他对风险敏感的RL问题中找到应用程序。

In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to real-world tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes $K$. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, our techniques for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive RL problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题