论文标题
识别对文本分类器的对抗性攻击
Identifying Adversarial Attacks on Text Classifiers
论文作者
论文摘要
针对文本分类器的对抗性攻击的景观持续增长,每年都会开发新的攻击,其中许多攻击都在标准工具包中(例如TextAttack和OpenAttack)提供。作为回应,在健壮的学习方面的工作越来越多,这会减少对这些攻击的脆弱性,尽管有时以高成本的计算时间或准确性成本。在本文中,我们采用另一种方法 - 我们试图通过分析对抗文本来确定使用哪些方法来创建攻击者来理解攻击者。我们的第一个贡献是用于攻击检测和标签的广泛数据集:1500万个攻击实例,该实例是由十二个对抗性攻击产生的,针对三个在六个源数据集中培训的分类器,以用英语进行情感分析和滥用检测。作为我们的第二个贡献,我们使用此数据集来开发和基准测试许多用于攻击标识的分类器 - 确定给定文本是否已被对抗和通过哪种攻击。作为第三个贡献,我们演示了这些任务三类特征的有效性:文本属性,捕获内容和文本的呈现;语言模型属性,确定哪些令牌在整个输入中或多或少是可能的;和目标模型属性,表示文本分类器如何受到攻击的影响,包括内部节点激活。总体而言,这代表了针对文本分类器的对抗性攻击的法医迈出的第一步。
The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling: 1.5~million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification -- determining if a given text has been adversarially manipulated and by which attack. As a third contribution, we demonstrate the effectiveness of three classes of features for these tasks: text properties, capturing content and presentation of text; language model properties, determining which tokens are more or less probable throughout the input; and target model properties, representing how the text classifier is influenced by the attack, including internal node activations. Overall, this represents a first step towards forensics for adversarial attacks against text classifiers.