论文标题
红色的智力团队阅读:对RL代理商的白色框对抗性政策
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents
论文作者
论文摘要
对抗性示例在部署之前可以在AI系统中识别漏洞很有用。在加强学习(RL)中,可以通过训练对抗代理来最大程度地减少目标代理的奖励来制定对抗性政策。先前的工作研究了这些攻击的黑框版本,在这些攻击中,对手只观察世界状态并将目标代理视为环境的任何其他部分。但是,这没有考虑到问题中的其他结构。在这项工作中,我们研究白框对抗性政策,并表明可以访问目标代理的内部状态对于识别其脆弱性很有用。我们做出了两个贡献。 (1)我们介绍了攻击者在每个时间步骤中都观察目标的内部状态和世界状态的白色框对抗性政策。我们制定了使用这些策略在2个玩家游戏和文本生成语言模型中攻击代理商的方法。 (2)我们证明,与黑盒对照相比,这些策略可以针对目标药物实现更高的初始和渐近性能。代码可从https://github.com/thestephencasper/lm_white_box_attacks获得
Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents in 2-player games and text-generating language models. (2) We demonstrate that these policies can achieve higher initial and asymptotic performance against a target agent than black-box controls. Code is available at https://github.com/thestephencasper/lm_white_box_attacks