安全：使用模拟中学到的各种政策的安全模拟机器人学习

论文标题

安全：使用模拟中学到的各种政策的安全模拟机器人学习

SafeAPT: Safe Simulation-to-Real Robot Learning using Diverse Policies Learned in Simulation

论文作者

Kaushik, Rituraj, Arndt, Karol, Kyrki, Ville

论文摘要

模拟学习的框架，即模拟中的学习政策，并将这些政策转移到现实世界中，是机器人技术中数据效率学习的最有希望的方法之一。但是，由于模拟与现实世界之间的不可避免的现实差距，模拟中学到的政策可能并不总是在真实机器人上产生安全的行为。结果，在现实世界中的政策改编期间，机器人可能会损害或损害周围的环境。在这项工作中，我们介绍了一种名为“ Safeapt”的新颖学习算法，该算法利用了模拟中演变的多种政策曲目，并通过情节互动将最有希望的安全政策传递给了真正的机器人。为了实现这一目标，可以使用现实世界的观察结果与模拟经验一起作为先验的实现，以学习概率奖励模型以及安全模型。然后，它使用奖励模型在曲目上进行贝叶斯优化，同时使用安全模型保持指定的安全约束。 Safeapt允许机器人通过模拟中演变的相同政策曲目安全地适应了广泛的目标。我们将Safeapt与模拟和真实机器人实验的几个基线进行了比较，并表明Safeapt在现实世界中几分钟内找到了高性能政策，同时在交互期间最大程度地减少了安全违规行为。

The framework of Simulation-to-real learning, i.e, learning policies in simulation and transferring those policies to the real world is one of the most promising approaches towards data-efficient learning in robotics. However, due to the inevitable reality gap between the simulation and the real world, a policy learned in the simulation may not always generate a safe behaviour on the real robot. As a result, during adaptation of the policy in the real world, the robot may damage itself or cause harm to its surroundings. In this work, we introduce a novel learning algorithm called SafeAPT that leverages a diverse repertoire of policies evolved in the simulation and transfers the most promising safe policy to the real robot through episodic interaction. To achieve this, SafeAPT iteratively learns a probabilistic reward model as well as a safety model using real-world observations combined with simulated experiences as priors. Then, it performs Bayesian optimization on the repertoire with the reward model while maintaining the specified safety constraint using the safety model. SafeAPT allows a robot to adapt to a wide range of goals safely with the same repertoire of policies evolved in the simulation. We compare SafeAPT with several baselines, both in simulated and real robotic experiments and show that SafeAPT finds high-performance policies within a few minutes in the real world while minimizing safety violations during the interactions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题