论文标题
适当的网络解释性有助于对抗分类的耐心性
Proper Network Interpretability Helps Adversarial Robustness in Classification
论文作者
论文摘要
最近的作品从经验上表明,存在可以从神经网络解释性中隐藏的对抗性示例(即,在视觉上使网络解释映射在视觉上相似),或者解释性本身很容易受到对抗性攻击的影响。在本文中,我们从理论上表明,通过正确的解释测量,实际上很难防止预测 - 逃避对抗性攻击引起解释差异,正如MNIST对MNIST,CIFAR-10和受限Imagenet的实验所证实的那样。因此,我们开发了一种仅基于促进强大的解释(而无需诉诸对抗损失最小化)的一种可解释性感知的防御计划。我们表明,我们的防御能够实现强大的分类和强大的解释,尤其是针对大型扰动攻击的最先进的对抗训练方法。
Recent works have empirically shown that there exist adversarial examples that can be hidden from neural network interpretability (namely, making network interpretation maps visually similar), or interpretability is itself susceptible to adversarial attacks. In this paper, we theoretically show that with a proper measurement of interpretation, it is actually difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted ImageNet. Spurred by that, we develop an interpretability-aware defensive scheme built only on promoting robust interpretation (without the need for resorting to adversarial loss minimization). We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation in particular.