数据有效的事后偏离政策选项学习

论文标题

数据有效的事后偏离政策选项学习

Data-efficient Hindsight Off-policy Option Learning

论文作者

Wulfmeier, Markus, Rao, Dushyant, Hafner, Roland, Lampe, Thomas, Abdolmaleki, Abbas, Hertweck, Tim, Neunert, Michael, Tirumala, Dhruva, Siegel, Noah, Heess, Nicolas, Riedmiller, Martin

论文摘要

我们介绍了事后脱离范围的偏离选项（HO2），这是一种数据效率的学习算法。考虑到任何轨迹，HO2 INVER可能的选项可能选择，并通过动态编程推理过程进行反向传播，以稳健地训练所有策略组件外部和端到端。该方法的表现优于常见基准的现有选项学习方法。为了更好地理解时间和动作抽象的期权框架和解开益处，我们通过具有可比的优化来评估平坦的策略和混合策略的消融。结果突出了两种类型的抽象以及非政策培训和信任区域约束的重要性，尤其是在挑战，模拟的3D机器人操纵任务中，来自原始像素输入。最后，我们直观地适应推理步骤，以研究增加时间抽象对培训和从头开始的培训的影响。

We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题