AWAC：通过离线数据集加速在线增强学习

论文标题

AWAC：通过离线数据集加速在线增强学习

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

论文作者

Nair, Ashvin, Gupta, Abhishek, Dalal, Murtaza, Levine, Sergey

论文摘要

强化学习（RL）为学习经验的学习控制政策提供了一种吸引人的形式主义。但是，RL的经典主动配方需要为每种行为进行漫长的主动探索过程，因此很难在现实世界中应用于机器人控制。如果我们可以允许RL算法有效地使用先前收集的数据来帮助在线学习过程，则可以使这些应用程序更加实用：先前的数据将提供一个起点，即由于勘探和样本的复杂性而减轻挑战，而在线培训则可以使代理商能够完善所需的技能。这样的先前数据可以构成专家演示或优化的先验数据，该数据说明了潜在有用的过渡。虽然许多先前的方法已经使用了最佳演示来引导RL，或者使用了次优数据来纯粹的离线训练，但使用离线数据训练政策并实际上继续使用在线RL进一步改进它仍然非常困难。在本文中，我们分析了为什么这个问题如此具有挑战性，并提出了一种将样本有效的动态编程与最大似然政策更新相结合的算法，提供了一个简单有效的框架，能够利用大量的离线数据，然后快速执行RL政策的在线微调。我们表明，我们的方法，优势加权演员评论家（AWAC），可以通过先前的演示数据和在线经验来快速学习技能。我们在模拟和现实世界的机器人域中证明了这些好处，包括具有真实多指手的灵巧操作，带有机器人臂的抽屉开口以及旋转阀门。我们的结果表明，合并先前的数据可以减少学习一系列机器人技能到实用的时间表所需的时间。

Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.

下载PDF全文

下载文献需遵守相关版权规定

论文标题