通过视觉显着纠正VIT快捷方式学习

论文标题

通过视觉显着纠正VIT快捷方式学习

Rectify ViT Shortcut Learning by Visual Saliency

论文作者

Ma, Chong, Zhao, Lin, Chen, Yuzhong, Liu, David Weizhong, Jiang, Xi, Zhang, Tuo, Hu, Xintao, Shen, Dinggang, Zhu, Dajiang, Liu, Tianming

论文摘要

快捷方式学习对深度学习模型很常见，但导致了退化的特征表示形式，因此危害了该模型的普遍性和解释性。但是，在广泛使用的视觉变压器框架中的快捷方式学习在很大程度上是未知的。同时，引入特定领域的知识是纠正捷径的主要方法，捷径为背景相关因素。例如，在医学成像领域中，放射科医生的眼睛凝视数据是一种有效的人类视觉先验知识，具有指导深度学习模型以专注于有意义的前景区域的巨大潜力。但是，获得眼睛凝视数据是时必的，劳动密集型的，有时甚至是不切实际的。在这项工作中，我们提出了一种新颖而有效的显着性视觉变压器（SGT）模型，以在没有眼神凝视数据的情况下在VIT中纠正快捷方式学习。具体而言，采用了计算视觉显着性模型来预测输入图像样本的显着性图。然后，显着图用于蒸馏出最有用的图像贴片。在拟议的中士中，图像贴片之间的自我注意力仅集中于蒸馏的信息。考虑到这种蒸馏操作可能会导致全局信息丢失，我们在最后一个编码层中进一步介绍了一个残留的连接，该连接捕获了所有图像贴片上的自我发挥。四个独立的公共数据集的实验结果表明，我们的SGT框架可以有效地学习和利用人类的先验知识，而无需眼睛凝视数据，并且比基线更好。同时，它成功地纠正了有害的快捷方式学习并显着提高了VIT模型的解释性，证明了传递人类先验知识在纠正快捷方式学习方面传递人类先验知识的希望

Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to distil the most informative image patches. In the proposed SGT, the self-attention among image patches focus only on the distilled informative ones. Considering this distill operation may lead to global information lost, we further introduce, in the last encoder layer, a residual connection that captures the self-attention across all the image patches. The experiment results on four independent public datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning

下载PDF全文

下载文献需遵守相关版权规定

论文标题