超越专注令牌：将令牌的重要性和多样性纳入有效的视觉变压器

论文标题

超越专注令牌：将令牌的重要性和多样性纳入有效的视觉变压器

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

论文作者

Long, Sifan, Zhao, Zhen, Pi, Jimin, Wang, Shengsheng, Wang, Jingdong

论文摘要

视觉变形金刚在各种视觉任务上取得了重大改进，但是它们之间的二次相互作用显着降低了计算效率。已经提出了许多修剪方法，以消除冗余令牌，以获得有效的视觉变压器。但是，现有的研究主要集中于保留当地专注令牌但完全忽略全球令牌多样性的重要性。在本文中，我们强调了各种全球语义的关键性，并提出了一种有效的令牌解耦和合并方法，该方法可以共同考虑令牌修剪的令牌重要性和多样性。根据班级的关注，我们将专心和不专心的代币脱致。除了保留最歧视性的局部令牌外，我们还合并了类似的不专心令牌并匹配同质的专注令牌，以最大程度地提高令牌多样性。尽管它很简单，但我们的方法在模型复杂性和分类准确性之间取决于有希望的权衡。在DEIT-S上，我们的方法将Flops降低了35％，仅准确度下降了0.2％。值得注意的是，从维护令牌多样性中受益，我们的方法甚至可以将DEIT-T的准确性提高0.1％，因为将其拖鞋降低了40％。

Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserving the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题