论文标题
严格地通过基于能量的分配匹配来模仿批次模仿
Strictly Batch Imitation Learning by Energy-based Distribution Matching
论文作者
论文摘要
考虑纯粹基于所证明的行为学习政策 - 也就是说,没有获得强化信号,没有过渡动态的知识,也没有与环境的进一步互动。这种 *严格地模仿学习 *在现场实验昂贵的地方,例如在医疗保健中,就会出现问题。一种解决方案仅仅是为了在离线环境中进行学徒学习的现有算法进行改造。但是这种方法在很大程度上依靠非政策评估或离线模型估计,并且可能是间接且效率低下的。我们认为,一个好的解决方案应该能够明确参数化策略(即尊重行动条件),隐含地从推出动力学(即利用状态边际)中学习,并且 - 至关重要 - 以完全离线的方式运行。为了应对这一挑战,我们通过 *基于 *能量的分布匹配 *(EDM)提出了一种新颖的技术:通过确定策略模型的参数化,具有(生成的)能量功能的状态分布,EDM产生了一个简单但有效的解决方案,以等效地将示范人员和模型的占用度量之间的差异最小化。通过对控制和医疗保健设置的应用实验,我们说明了严格模仿学习的现有算法的一致性增长。
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. This *strictly batch imitation learning* problem arises wherever live experimentation is costly, such as in healthcare. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy (i.e. respecting action conditionals), implicitly learn from rollout dynamics (i.e. leveraging state marginals), and -- crucially -- operate in an entirely offline fashion. To address this challenge, we propose a novel technique by *energy-based distribution matching* (EDM): By identifying parameterizations of the (discriminative) model of a policy with the (generative) energy function for state distributions, EDM yields a simple but effective solution that equivalently minimizes a divergence between the occupancy measure for the demonstrator and a model thereof for the imitator. Through experiments with application to control and healthcare settings, we illustrate consistent performance gains over existing algorithms for strictly batch imitation learning.