论文标题
从反事实解释中提取模型
Model extraction from counterfactual explanations
论文作者
论文摘要
事后解释技术是指可用于解释黑盒机器学习模型如何产生其结果的后验方法。在事后解释技术中,反事实解释正在成为实现这一目标的最受欢迎的方法之一。特别是,除了强调黑框模型使用的最重要功能外,它们还以数据实例的形式为用户提供了可操作的解释,这些解释将会获得不同的结果。尽管如此,通过这样做,它们还泄露了有关模型本身的非平凡信息,这引发了隐私问题。在这项工作中,我们演示了对手如何利用反事实解释提供的信息来建立高保真和高智能模型提取攻击。更确切地说,我们的攻击使对手能够通过访问其反事实解释来建立目标模型的忠实副本。对在现实世界数据集中训练的黑框模型的拟议攻击的经验评估表明,即使在低查询预算下,它们也可以实现高保真性和高智能提取。
Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.