论文标题

Q学习决策变压器:利用离线RL中有条件序列建模的动态编程

Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

论文作者

Yamagata, Taku, Khalil, Ahmed, Santos-Rodriguez, Raul

论文摘要

最近的作品表明,通过有条件的政策来解决离线增强学习(RL)会产生令人鼓舞的结果。决策变压器(DT)结合了条件策略方法和变压器体系结构,显示了针对多个基准的竞争性能。但是,DT缺乏缝线能力 - 离线RL从亚最佳轨迹中学习最佳策略的关键能力之一。当离线数据集仅包含亚最佳轨迹时,此问题变得尤为重要。另一方面,基于动态编程(例如Q-学习)的常规RL方法没有相同的限制。但是,他们患有不稳定的学习行为,尤其是当他们依靠功能近似值时。在本文中,我们提出了Q学习决策变压器(QDT),以利用动态编程的好处(Q-Learning)来解决DT的缺点。它利用动态编程结果来重新标记培训数据中的返回,然后使用重新标记的数据训练DT。我们的方法有效利用了这两种方法的好处,并弥补了彼此的缺点,以取得更好的绩效。我们在简单的玩具环境和更复杂的D4RL基准测试中均表现出竞争性的性能提高。

Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源