Vision Transformer Slimming：在连续优化空间中进行多维搜索

论文标题

Vision Transformer Slimming：在连续优化空间中进行多维搜索

Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

论文作者

Chavan, Arnav, Shen, Zhiqiang, Liu, Zhuang, Liu, Zechun, Cheng, Kwang-Ting, Xing, Eric

论文摘要

本文探讨了从视觉变压器找到最佳子模型的可行性，并引入了纯视觉变压器纤巧（VIT-SLIM）框架。它可以从多个维度端到端搜索一个子结构，包括具有最先进的性能的输入令牌，MHSA和MLP模块。我们的方法基于具有预定义因素的可学习且统一的$ \ ell_1 $稀疏约束，以反映不同维度的连续搜索空间中的全球重要性。通过单次训练方案，搜索过程非常有效。例如，在DEIT-S上，VIT-SLIM仅需约43个GPU小时才能进行搜索过程，并且搜索结构在不同模块中具有不同的维度灵活。然后，根据运行设备的准确范围折衷的要求采用了预算阈值，并执行重新培训过程以获得最终模型。广泛的实验表明，我们的VIT-SLIM可以在各种视觉变压器上压缩多达40％的参数和40％的拖鞋，同时在Imagenet上将精度提高约0.6％。我们还展示了我们在几个下游数据集上搜索模型的优势。我们的代码可在https://github.com/arnav0400/vit-slim上找到。

This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified $\ell_1$ sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes ~43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a re-training process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by ~0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.

下载PDF全文

下载文献需遵守相关版权规定

论文标题