欺骗性的AI解释：创建和检测

论文标题

欺骗性的AI解释：创建和检测

Deceptive AI Explanations: Creation and Detection

论文作者

Schneider, Johannes, Meske, Christian, Vlachos, Michalis

论文摘要

人工智能（AI）带来了巨大的机会，但也可能带来很大的风险。自动生成决策的解释可以提高透明度和促进信任，特别是对于基于AI模型自动预测的系统。但是，鉴于，例如创造不诚实AI的经济激励措施，我们可以在多大程度上相信解释？为了解决这个问题，我们的工作调查了AI模型（即深度学习和现有工具以提高AI决策的透明度）如何用于创建和检测欺骗性的解释。作为经验评估，我们专注于文本分类并改变Gradcam产生的解释，这是神经网络中一种公认的解释技术。然后，我们在200名参与者的实验中评估了欺骗性解释对用户的影响。我们的发现证实，欺骗性的解释确实可以欺骗人类。但是，在足够的域知识的情况下，可以部署机器学习（ML）方法以超过80％的准确性检测出较小的欺骗尝试。没有领域知识，鉴于对审查的预测模型的基本知识，人们仍然可以在解释中以解释的方式推断出矛盾。

Artificial intelligence (AI) comes with great opportunities but can also pose significant risks. Automatically generated explanations for decisions can increase transparency and foster trust, especially for systems based on automated predictions by AI models. However, given, e.g., economic incentives to create dishonest AI, to what extent can we trust explanations? To address this issue, our work investigates how AI models (i.e., deep learning, and existing instruments to increase transparency regarding AI decisions) can be used to create and detect deceptive explanations. As an empirical evaluation, we focus on text classification and alter the explanations generated by GradCAM, a well-established explanation technique in neural networks. Then, we evaluate the effect of deceptive explanations on users in an experiment with 200 participants. Our findings confirm that deceptive explanations can indeed fool humans. However, one can deploy machine learning (ML) methods to detect seemingly minor deception attempts with accuracy exceeding 80% given sufficient domain knowledge. Without domain knowledge, one can still infer inconsistencies in the explanations in an unsupervised manner, given basic knowledge of the predictive model under scrutiny.

下载PDF全文

下载文献需遵守相关版权规定

论文标题