Max-Min非政策演员 - 批评方法，重点是最差的鲁棒性，以模型错误指定

论文标题

Max-Min非政策演员 - 批评方法，重点是最差的鲁棒性，以模型错误指定

Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification

论文作者

Tanabe, Takumi, Sato, Rei, Fukuchi, Kazuto, Sakuma, Jun, Akimoto, Youhei

论文摘要

在强化学习领域，由于现实世界中政策培训的高成本和风险，政策在模拟环境中进行了培训，并转移到相应的现实世界环境中。但是，仿真环境并不能完全模仿现实世界的环境，从而导致模型错误指定。多项研究报告了在现实环境中政策绩效的严重恶化。在这项研究中，我们专注于涉及具有不确定性参数的模拟环境及其可能值的集合，称为不确定性参数集。目的是优化在不确定性参数集中最差的性能，以确保相应的现实环境中的性能。为了获得优化的策略，我们提出了一种称为Max-Min Twin延迟的深层确定性策略梯度算法（M2TD3）的非政策批评者的方法，该方法使用同时梯度上升下降方法解决了最大的Min优化问题。与接触（Mujoco）环境的多关节动力学实验表明，所提出的方法表现出比几种基线方法优越的最差性能。

In the field of reinforcement learning, because of the high cost and risk of policy training in the real world, policies are trained in a simulation environment and transferred to the corresponding real-world environment. However, the simulation environment does not perfectly mimic the real-world environment, lead to model misspecification. Multiple studies report significant deterioration of policy performance in a real-world environment. In this study, we focus on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values, called the uncertainty parameter set. The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment. To obtain a policy for the optimization, we propose an off-policy actor-critic approach called the Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach. Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题