离线评估和离线增强学习的混合价值估计

论文标题

离线评估和离线增强学习的混合价值估计

Hybrid Value Estimation for Off-policy Evaluation and Offline Reinforcement Learning

论文作者

Jin, Xue-Kun, Liu, Xu-Hui, Jiang, Shengyi, Yu, Yang

论文摘要

价值函数估计是增强学习中必不可少的子例证，在离线环境中变得更具挑战性。在本文中，我们提出了混合价值估计（HVE），以减少价值估计误差，从而通过平衡离线数据的价值估计和学习模型来抵消偏差和差异。理论分析揭示了HVE比直接方法的错误绑定更好。 HVE可以在非政策评估和离线增强学习设置中利用。因此，我们分别提供两种混凝土算法非政策HVE（OPHVE）和基于模型的离线HVE（MOHVE）。关于穆乔科任务的经验评估证实了理论主张。 OPHVE在测量估计效果的所有三个指标中都优于其他非政策评估方法，而MOHVE与最先进的离线强化学习算法相比，MOHVE的性能更好或可比性。我们希望HVE可以阐明从固定数据中对加强学习的进一步研究。

Value function estimation is an indispensable subroutine in reinforcement learning, which becomes more challenging in the offline setting. In this paper, we propose Hybrid Value Estimation (HVE) to reduce value estimation error, which trades off bias and variance by balancing between the value estimation from offline data and the learned model. Theoretical analysis discloses that HVE enjoys a better error bound than the direct methods. HVE can be leveraged in both off-policy evaluation and offline reinforcement learning settings. We, therefore, provide two concrete algorithms Off-policy HVE (OPHVE) and Model-based Offline HVE (MOHVE), respectively. Empirical evaluations on MuJoCo tasks corroborate the theoretical claim. OPHVE outperforms other off-policy evaluation methods in all three metrics measuring the estimation effectiveness, while MOHVE achieves better or comparable performance with state-of-the-art offline reinforcement learning algorithms. We hope that HVE could shed some light on further research on reinforcement learning from fixed data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题