论文标题

离线增强学习的数据评估

Data Valuation for Offline Reinforcement Learning

论文作者

Abolfazli, Amir, Palmer, Gregory, Kudenko, Daniel

论文摘要

深度加强学习(DRL)的成功取决于培训数据的可用性,通常通过大量的环境相互作用获得。在许多实际情况下,成本和风险与收集这些数据有关。离线强化学习领域通过将数据收集外包给域专家或经过精心监控的程序,并随后搜索批处理约束的最佳策略来解决这些问题。随着数据市场的出现,内部构建数据集的替代方法是购买外部数据。但是,尽管最新的离线加强学习方法已经表现出了很多希望,但他们目前依靠精心构造的数据集,这些数据集与预期的目标域非常一致。这就提出了有关经过外部获取数据培训的离线增强学习代理的可转移性和鲁棒性的问题。在本文中,我们凭经验评估了当前最新的离线增强学习方法应在两个穆约科克环境中应对源目标域不匹配的能力,发现当前最新的离线加强学习算法在目标域中的表现不足。为了解决这个问题,我们提出了离线增强学习(DVORL)的数据评估,这使我们能够确定相关和高质量的过渡,从而提高了通过离线加强学习算法所学到的策略的性能和可转移性。结果表明,我们的方法在两个mujoco环境上优于离线增强学习基线。

The success of deep reinforcement learning (DRL) hinges on the availability of training data, which is typically obtained via a large number of environment interactions. In many real-world scenarios, costs and risks are associated with gathering these data. The field of offline reinforcement learning addresses these issues through outsourcing the collection of data to a domain expert or a carefully monitored program and subsequently searching for a batch-constrained optimal policy. With the emergence of data markets, an alternative to constructing a dataset in-house is to purchase external data. However, while state-of-the-art offline reinforcement learning approaches have shown a lot of promise, they currently rely on carefully constructed datasets that are well aligned with the intended target domains. This raises questions regarding the transferability and robustness of an offline reinforcement learning agent trained on externally acquired data. In this paper, we empirically evaluate the ability of the current state-of-the-art offline reinforcement learning approaches to coping with the source-target domain mismatch within two MuJoCo environments, finding that current state-of-the-art offline reinforcement learning algorithms underperform in the target domain. To address this, we propose data valuation for offline reinforcement learning (DVORL), which allows us to identify relevant and high-quality transitions, improving the performance and transferability of policies learned by offline reinforcement learning algorithms. The results show that our method outperforms offline reinforcement learning baselines on two MuJoCo environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源