Wedef：用于文本分类的弱监督后门防御

论文标题

Wedef：用于文本分类的弱监督后门防御

WeDef: Weakly Supervised Backdoor Defense for Text Classification

论文作者

Jin, Lesheng, Wang, Zihan, Shang, Jingbo

论文摘要

现有的后门防御方法仅对有限的触发类型有效。为了立即捍卫不同的触发类型，我们从中毒过程的阶级性质开始，并提出了一个新颖的弱监督后门防御框架Wedef。弱监督的最新进展使得仅使用少数用户提供的类概述的种子单词训练合理准确的文本分类器。这种种子词应视为独立于触发器。因此，只有在没有标签的中毒文档培训的弱监督的文本分类器可能没有后门。受这一观察结果的启发，在Wedef中，我们根据弱分类器的预测是否与中毒训练集中的标签一致，从而定义了样本的可靠性。我们通过两阶段的消毒进一步改善结果：（1）基于可靠的样本迭代完善弱分类器，以及（2）通过将最不可靠的样本与最可靠的样品区分开来训练二进制毒药分类器。最后，我们对毒药分类器预测为良性的样品进行了训练。广泛的实验表明，Wedefis有效地反对流行的基于触发的攻击（例如，单词，句子和释义），表现优于现有的防御方法。

Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1) iteratively refine the weak classifier based on the reliable samples and (2) train a binary poison classifier by distinguishing the most unreliable samples from the most reliable samples. Finally, we train the sanitized model on the samples that the poison classifier predicts as benign. Extensive experiments show that WeDefis effective against popular trigger-based attacks (e.g., words, sentences, and paraphrases), outperforming existing defense methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题