防止对抗性政策合奏的模仿学习

论文标题

防止对抗性政策合奏的模仿学习

Preventing Imitation Learning with Adversarial Policy Ensembles

论文作者

Zhan, Albert, Tiomkin, Stas, Abbeel, Pieter

论文摘要

模仿学习可以通过观察专家来复制政策，这在政策隐私方面提出了问题。未经所有者同意，都可以克隆人类的政策，例如人类或部署机器人的政策。我们如何防止外部观察者克隆我们的专有政策？为了回答这个问题，我们介绍了一个新的强化学习框架，在那里我们培训了一系列近乎最佳的政策，其示范保证对外部观察者无用。我们通过一个受约束的优化问题来提出这个想法，目的是改善专有政策，同时降低最终外部观察者的虚拟政策。我们设计了一种可拖动算法来通过修改标准策略梯度算法来解决这个新的优化问题。我们的表述可以用保密性和对抗性行为的镜头来解释，从而更广泛地看待这项工作。我们证明了“不可链式”合奏的存在，为上述优化问题提供了解决方案，该解决方案是由我们修改的策略梯度算法计算得出的。据我们所知，这是有关保护政策学习中的第一批工作。

Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy. Policies, such as human, or policies on deployed robots, can all be cloned without consent from the owners. How can we protect against external observers cloning our proprietary policies? To answer this question we introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies, whose demonstrations are guaranteed to be useless for an external observer. We formulate this idea by a constrained optimization problem, where the objective is to improve proprietary policies, and at the same time deteriorate the virtual policy of an eventual external observer. We design a tractable algorithm to solve this new optimization problem by modifying the standard policy gradient algorithm. Our formulation can be interpreted in lenses of confidentiality and adversarial behaviour, which enables a broader perspective of this work. We demonstrate the existence of "non-clonable" ensembles, providing a solution to the above optimization problem, which is calculated by our modified policy gradient algorithm. To our knowledge, this is the first work regarding the protection of policies in Reinforcement Learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题