对抗性软优势的配合：模仿学习而没有政策优化

论文标题

对抗性软优势的配合：模仿学习而没有政策优化

Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

论文作者

Barde, Paul, Roy, Julien, Jeon, Wonseok, Pineau, Joelle, Pal, Christopher, Nowrouzezahrai, Derek

论文摘要

对抗性的模仿学习在学习歧视者之间交替 - 它告诉了专家与生成的示范的演示 - 以及发电机的政策，以产生可能欺骗这种歧视者的轨迹。已知这种交替的优化在实践中是微妙的，因为它将不稳定的对抗训练与脆弱和样品感知的增强学习更加复杂。我们建议通过利用新颖的歧视器配方来消除政策优化步骤的负担。具体而言，我们的歧视者明确根据两种政策进行条件：一项策略是前一个发电机的迭代和可学习的策略。优化后，该歧视者直接学习最佳发电机的策略。因此，我们的歧视器的更新可以免费解决生成器的优化问题：学习模仿专家的策略不需要额外的优化循环。这种表述有效地减少了一半的实施和计算负担，通过完全消除强化学习阶段，对对抗性模仿学习算法的实施和计算负担。我们在各种任务上表明，我们的简单方法在普遍的模仿学习方法上具有竞争力。

Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of Adversarial Imitation Learning algorithms by removing the Reinforcement Learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent Imitation Learning methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题