论文标题
通过信息瓶颈改善NLP模型的对抗性鲁棒性
Improving the Adversarial Robustness of NLP Models by Information Bottleneck
论文作者
论文摘要
现有的研究表明,对抗性示例可以直接归因于具有高度预测性的非舒适特征的存在,但可以通过对手可以轻松地操纵以愚弄NLP模型。在这项研究中,我们探讨了捕获特定于任务的鲁棒特征的可行性,同时使用信息瓶颈理论消除了非运动的功能。通过广泛的实验,我们表明,通过我们的信息基于瓶颈的方法训练的模型能够在稳健的精度上取得显着提高,超过所有先前报道的防御方法的性能,而在SST-2,AGNEWS和IMDB数据集上几乎没有遭受清洁准确性的表现下降。
Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.