Atro：具有拒绝选项的对抗训练

论文标题

Atro：具有拒绝选项的对抗训练

ATRO: Adversarial Training with a Rejection Option

论文作者

Kato, Masahiro, Cui, Zhenghang, Fukuhara, Yoshihiro

论文摘要

本文提出了一个具有拒绝选项的分类框架，以减轻由对抗性示例引起的性能恶化。尽管最近的机器学习算法达到了高预测性能，但它们在经验上很容易受到对抗性示例的影响，这些示例是错误分类的略有干扰数据样本。在实际应用中，使用此类对抗性示例的对抗性攻击可能会导致严重问题。为此，提出了各种方法来获得与对抗示例相对于对抗性示例具有鲁棒性的分类器。对抗性训练是其中之一，它训练分类器，以最大程度地减少对抗性攻击下最严重的损失。在本文中，为了获得针对对抗性攻击的更可靠的分类器，我们提出了使用拒绝选项（ATRO）的对抗训练方法。同时将对抗性训练目标应用于分类器和拒绝功能，ATRO培训的分类器可以选择在没有足够的信心来对测试数据点进行分类时弃用分类。我们使用替代最大铰链损耗检查框架的可行性，并为线性模型建立概括。此外，我们使用各种模型和现实世界数据集从经验上证实了ATRO的有效性。

This paper proposes a classification framework with a rejection option to mitigate the performance deterioration caused by adversarial examples. While recent machine learning algorithms achieve high prediction performance, they are empirically vulnerable to adversarial examples, which are slightly perturbed data samples that are wrongly classified. In real-world applications, adversarial attacks using such adversarial examples could cause serious problems. To this end, various methods are proposed to obtain a classifier that is robust against adversarial examples. Adversarial training is one of them, which trains a classifier to minimize the worst-case loss under adversarial attacks. In this paper, in order to acquire a more reliable classifier against adversarial attacks, we propose the method of Adversarial Training with a Rejection Option (ATRO). Applying the adversarial training objective to both a classifier and a rejection function simultaneously, classifiers trained by ATRO can choose to abstain from classification when it has insufficient confidence to classify a test data point. We examine the feasibility of the framework using the surrogate maximum hinge loss and establish a generalization bound for linear models. Furthermore, we empirically confirmed the effectiveness of ATRO using various models and real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题