通过运动域随机化和适应的政策转移

论文标题

通过运动域随机化和适应的政策转移

Policy Transfer via Kinematic Domain Randomization and Adaptation

论文作者

Exarchos, Ioannis, Jiang, Yifeng, Yu, Wenhao, Liu, C. Karen

论文摘要

将培训的物理模拟培训的强化学习政策转移到真实的硬件仍然是一个挑战，称为“ SIM到真实”差距。域随机化是一种简单而有效的技术，可以解决跨源和目标域之间的动态差异，但其成功通常取决于启发式方法和反复试验。在这项工作中，我们研究了随机参数选择对不同类型域差异策略可传递性的影响。与在动态参数随机分配时仔细测量运动学参数的常见实践相反，我们发现在仿真训练期间，实际上将运动运动参数（例如，链路长度）实际上胜过动态随机性。基于这一发现，我们引入了一种使用模拟运动学参数变化的新域自适应算法。我们的算法，多政策的贝叶斯优化，训练以虚拟运动学参数为条件的通用策略合奏，并使用有限数量的目标域推出有效地适应了目标环境。我们在涵盖域差异不同方面的五个不同目标环境中，在模拟四倍的机器人上展示了我们的发现。

Transferring reinforcement learning policies trained in physics simulation to the real hardware remains a challenge, known as the "sim-to-real" gap. Domain randomization is a simple yet effective technique to address dynamics discrepancies across source and target domains, but its success generally depends on heuristics and trial-and-error. In this work we investigate the impact of randomized parameter selection on policy transferability across different types of domain discrepancies. Contrary to common practice in which kinematic parameters are carefully measured while dynamic parameters are randomized, we found that virtually randomizing kinematic parameters (e.g., link lengths) during training in simulation generally outperforms dynamic randomization. Based on this finding, we introduce a new domain adaptation algorithm that utilizes simulated kinematic parameters variation. Our algorithm, Multi-Policy Bayesian Optimization, trains an ensemble of universal policies conditioned on virtual kinematic parameters and efficiently adapts to the target environment using a limited number of target domain rollouts. We showcase our findings on a simulated quadruped robot in five different target environments covering different aspects of domain discrepancies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题