对抗纯化的扩散模型

论文标题

对抗纯化的扩散模型

Diffusion Models for Adversarial Purification

论文作者

Nie, Weili, Guo, Brandon, Huang, Yujia, Xiao, Chaowei, Vahdat, Arash, Anandkumar, Anima

论文摘要

对抗性纯化是指使用生成模型消除对抗扰动的一类防御方法。这些方法没有对攻击形式和分类模型的形式做出假设，因此可以捍卫现有的分类器免受看不见的威胁。但是，他们的表现目前落后于对抗训练方法。在这项工作中，我们提出了使用扩散模型进行对抗纯化的扩散：给定对抗性示例，我们首先在正向扩散过程后用少量噪声扩散它，然后通过反向生成过程恢复干净的图像。为了以有效且可扩展的方式评估我们的方法针对强大的自适应攻击，我们建议使用伴随方法来计算反向生成过程的完整梯度。在包括CIFAR-10，ImageNet和Celeba-HQ在内的三个图像数据集上进行的大量实验，具有三个分类器架构，包括Resnet，wideresnet和Vit，这表明我们的方法通常超过了当前的对抗性训练，并且经常通过很大的利润来实现当前的对抗训练和对抗性净化方法。项目页面：https：//diffpure.github.io。

Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative model. These methods do not make assumptions on the form of attack and the classification model, and thus can defend pre-existing classifiers against unseen threats. However, their performance currently falls behind adversarial training methods. In this work, we propose DiffPure that uses diffusion models for adversarial purification: Given an adversarial example, we first diffuse it with a small amount of noise following a forward diffusion process, and then recover the clean image through a reverse generative process. To evaluate our method against strong adaptive attacks in an efficient and scalable way, we propose to use the adjoint method to compute full gradients of the reverse generative process. Extensive experiments on three image datasets including CIFAR-10, ImageNet and CelebA-HQ with three classifier architectures including ResNet, WideResNet and ViT demonstrate that our method achieves the state-of-the-art results, outperforming current adversarial training and adversarial purification methods, often by a large margin. Project page: https://diffpure.github.io.

下载PDF全文

下载文献需遵守相关版权规定

论文标题