论文标题

部分可观测时空混沌系统的无模型预测

Deconfounding Imitation Learning with Variational Inference

论文作者

Vuorio, Risto, de Haan, Pim, Brehmer, Johann, Ackermann, Hanno, Dijkman, Daniel, Cohen, Taco

论文摘要

当专家演示者的感官输入与模仿代理不同时,标准模仿学习可能会失败。这是因为部分可观察性引起了因果图中隐藏的混杂因素。在以前的工作中,要解决混杂问题,已经使用查询专家的政策或反向强化学习(IRL)对策略进行了培训。但是,两种方法都有缺点,因为专家的政策可能无法使用,而在实践中IRL可能是不稳定的。取而代之的是,我们建议培训各种推理模型,以推断专家的潜在信息并使用它来培训潜在的条件政策。我们证明,在强有力的假设下,使用这种方法,从理论上讲,仅从专家演示中就可以识别正确的模仿学习政策。在实践中,我们专注于具有较不强大假设的设置,在该设置中,我们使用探索数据来学习推理模型。我们在理论和实践中表明,该算法会融合正确的介入政策,解决混杂问题,并在某些假设下可以实现渐近最佳的模仿性能。

Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. In previous work, to work around the confounding problem, policies have been trained using query access to the expert's policy or inverse reinforcement learning (IRL). However, both approaches have drawbacks as the expert's policy may not be available and IRL can be unstable in practice. Instead, we propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy. We prove that using this method, under strong assumptions, the identification of the correct imitation learning policy is theoretically possible from expert demonstrations alone. In practice, we focus on a setting with less strong assumptions where we use exploration data for learning the inference model. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源