论文标题
基于扰动的爆发后解释器
Unfooling Perturbation-Based Post Hoc Explainers
论文作者
论文摘要
人工智能(AI)的巨大进步吸引了医生,贷方,法官和其他专业人员的兴趣。尽管这些高风险决策者对这项技术保持乐观,但熟悉AI系统的人对其决策过程缺乏透明度保持警惕。基于扰动的事后解释器提供了一种模型的不可知论手段来解释这些系统,同时仅需要查询级别的访问。但是,最近的工作表明,这些解释者可以被愚弄。这一发现对审计师,监管机构和其他哨兵具有不利影响。考虑到这一点,出现了几个自然问题 - 我们如何审核这些黑匣子系统?我们如何确定审计人真诚地遵守审计?在这项工作中,我们严格地将这个问题正式形式化,并为对基于扰动的解释者的对抗性攻击而制定防御。我们提出了这些攻击的检测(CAD检测)和防御(CAD-DEFEND)的算法,这是通过我们新颖的条件异常检测方法KNN-CAD的帮助。我们证明我们的方法成功地检测到黑匣子系统是否在对抗中隐藏了其决策过程,并减轻对现实数据的对抗性攻击,以实现普遍的解释者,石灰和摇摆。
Monumental advancements in artificial intelligence (AI) have lured the interest of doctors, lenders, judges, and other professionals. While these high-stakes decision-makers are optimistic about the technology, those familiar with AI systems are wary about the lack of transparency of its decision-making processes. Perturbation-based post hoc explainers offer a model agnostic means of interpreting these systems while only requiring query-level access. However, recent work demonstrates that these explainers can be fooled adversarially. This discovery has adverse implications for auditors, regulators, and other sentinels. With this in mind, several natural questions arise - how can we audit these black box systems? And how can we ascertain that the auditee is complying with the audit in good faith? In this work, we rigorously formalize this problem and devise a defense against adversarial attacks on perturbation-based explainers. We propose algorithms for the detection (CAD-Detect) and defense (CAD-Defend) of these attacks, which are aided by our novel conditional anomaly detection approach, KNN-CAD. We demonstrate that our approach successfully detects whether a black box system adversarially conceals its decision-making process and mitigates the adversarial attack on real-world data for the prevalent explainers, LIME and SHAP.