论文标题
通过规避风险的强化学习优化均匀政策的政策
Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning
论文作者
论文摘要
在现实世界中的决策情况(例如财务,机器人技术,自动驾驶等)中,控制风险通常比最大化预期奖励更为重要。风险措施的最自然选择是差异,这对上升波动和下行部分的损害。取而代之的是,捕获随机变量在其平均值下的负面偏差的(下利)半变量更适合于规避风险的提议。本文旨在优化增强学习中的平均偏见(MSV)标准W.R.T.稳定奖励分布。由于半变量是时间的,并且不满足标准的钟声方程,因此传统的动态编程方法不适合直接不适合MSV问题。为了应对这一挑战,我们求助于扰动分析(PA)理论,并建立MSV的性能差异公式。我们揭示MSV问题可以通过迭代解决与策略有关的奖励功能的一系列RL问题来解决。此外,我们根据政策梯度理论和信任区域方法提出了两种上政策算法。最后,我们进行了不同的实验,从简单的匪徒问题到穆乔科的连续控制任务,这些实验证明了我们提出的方法的有效性。
Keeping risk under control is often more crucial than maximizing expected rewards in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, which penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures the negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady reward distribution. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.