解释在线模仿学习的快速改进

论文标题

解释在线模仿学习的快速改进

Explaining Fast Improvement in Online Imitation Learning

论文作者

Yan, Xinyan, Boots, Byron, Cheng, Ching-An

论文摘要

在线模仿学习（IL）是一个算法框架，它利用与专家政策进行互动以进行有效的政策优化。在这里，通过对鼓励学习者模仿专家行动的一系列损失功能进行在线学习来优化政策，如果在线学习不后悔，则代理商可以证明可以学习类似专家的政策。在线IL在许多应用中都表现出了经验成功，有趣的是，其在实践中观察到的政策改进速度通常比现有理论所建议的要快得多。在这项工作中，我们提供了这种现象的解释。令$ξ$表示政策类别的偏见，并假定在线IL损失功能是凸，平稳且不负的。我们证明，在带有随机反馈的$ N $回合在线IL之后，该策略在$ \ tilde {o}（1/n + \ sqrt {ξ/n}）$中都在期望和高概率中提高。换句话说，我们表明，在线IL中采用足够表达的政策类有两个好处：政策提高速度的提高和绩效偏见都会下降。

Online imitation learning (IL) is an algorithmic framework that leverages interactions with expert policies for efficient policy optimization. Here policies are optimized by performing online learning on a sequence of loss functions that encourage the learner to mimic expert actions, and if the online learning has no regret, the agent can provably learn an expert-like policy. Online IL has demonstrated empirical successes in many applications and interestingly, its policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this phenomenon. Let $ξ$ denote the policy class bias and assume the online IL loss functions are convex, smooth, and non-negative. We prove that, after $N$ rounds of online IL with stochastic feedback, the policy improves in $\tilde{O}(1/N + \sqrt{ξ/N})$ in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in online IL has two benefits: both the policy improvement speed increases and the performance bias decreases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题