论文标题
对抗特征脱敏
Adversarial Feature Desensitization
论文作者
论文摘要
已知神经网络容易受到对抗性攻击的攻击 - 对输入的轻微但精心构建的扰动可能会严重损害网络的性能。已经提出了许多防御方法来通过对对抗性的输入进行培训,以改善深网的鲁棒性。但是,这些模型通常仍然容易受到训练期间未见的新型攻击,甚至更强大的先前攻击版本。在这项工作中,我们提出了一种新颖的方法来实现对抗性鲁棒性,这是基于域适应领域的见解。我们的方法称为对抗特征脱敏(AFD),旨在学习对输入的对抗扰动不变的学习特征。这是通过一个游戏来实现的,在该游戏中,我们学习既有预测性又坚固的功能(对对抗性攻击不敏感),即不能用来区分自然数据和对抗性数据。几个基准的经验结果证明了拟议方法对广泛的攻击类型和攻击强度的有效性。我们的代码可在https://github.com/bashivanlab/afd上找到。
Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.